PTP-FAKD · Progressive Token Pruning for Vision Transformers

Top-1 accuracy

89.60%

CIFAR-100, ViT-B/16

Above teacher

+2.57

percentage points

Attention FLOPs

0.265×

73.5% reduction

Parameters added

over base ViT-B/16

The problem

Vision Transformers process images as sequences of patch tokens. Self-attention scales quadratically with sequence length — for ViT-B/16 on a 224×224 image, that's 38,416 pairwise scores per attention head per layer. Token pruning reduces this cost, but pruning at the input layer discards tokens before they carry semantic meaning. The wrong tokens get cut and accuracy drops.

Our approach

✂ PTP — Progressive Token Pruning

Prune progressively at stages 1/4 and 2/3 of the encoder, scoring tokens by L₂-norm of their post-stage hidden representations. No additional learned parameters.

⚖ FAKD — Feature-Aware KD

Align CLS representations between sparse student and frozen dense teacher exactly at pruning boundaries, where divergence is structurally greatest.

Key finding

A controlled 2×2 ablation isolates pruning timing as the dominant accuracy contributor. Progressive pruning without any distillation already reaches 88.98% — beating the dense teacher by 1.95 points. Distillation adds a further 0.62. The schedule, not the supervision, is what drives the gain.

Why it matters

The 73.5% reduction in attention FLOPs translates directly to lower inference latency in edge deployment scenarios — mobile devices, embedded systems, real-time video pipelines — where quadratic attention is the primary throughput bottleneck. Because PTP-FAKD adds no parameters, it applies directly to pretrained ViT-B/16 checkpoints without architectural changes, and the framework partitions naturally to deeper models like ViT-L/16 (24 blocks, stages of 8) and ViT-H/14 (32 blocks, stages of ~11).

An input image becomes 197 tokens (196 patches + CLS). The student processes them through three stages of 4 blocks each, pruning between stages. The frozen teacher processes the full sequence in parallel and supervises the student at logit and feature level.

Why this works

Naive pruning scores tokens at the embedding layer, where representations carry texture statistics but no semantic content. A high-norm token there often corresponds to a high-frequency edge patch, not a class-relevant region. Pruning after 4 and 8 blocks of attention lets context propagate first — by then, high-norm tokens correspond to object-centric regions. The classification head ends up attending to semantically concentrated tokens, which has an implicit regularization effect that explains why the sparse student outperforms the dense teacher.

Loss formulation

L = (1 − α)·L_CE + α·L_KD + β·L_feat

Cross-entropy with label smoothing (ε = 0.1), KL-divergence distillation at temperature T = 4.0, and L₂-normalized MSE on CLS features. α anneals from 0.70 to 0.50 over training. β = 0.3 is fixed.

Pruning ratios

κ₁ = 0.85 (keeps 85% of tokens after stage 1, reducing 197 → 167) and κ₂ = 0.50 (keeps half after stage 2, reducing 167 → 98). The CLS token is never pruned. Selected via grid search on a held-out 10% validation split of CIFAR-100, kept independent of the test set.

CIFAR-100 ablation

2×2 ablation breakdown

The four-corner experiment that isolates pruning timing from supervision objective.

Naive · CE only

78.29%

−8.74 vs teacher

Naive · feature KD

84.91%

−2.12 vs teacher

Progressive · CE only

88.98%

+1.95 vs teacher

★ PTP-FAKD (ours)

89.60%

+2.57 vs teacher

Accuracy vs compute

CIFAR-10 transfer

Logit-only KD on a sparse ViT-B/16 student transfers cleanly to CIFAR-10, closing 30% of the teacher gap.

Dense teacher

96.89%

Naive sparse

91.97%

Sparse + logit KD

93.45%

ⓘ Token retention is illustrative — patterns mirror real PTP-FAKD inference behavior shown in Fig. 5 of the paper.

Pick a sample image, then drag the slider through the four pruning stages. Watch how PTP keeps semantically meaningful tokens while naive pruning scatters across edges and textures.

Pruning stage Input · 197

Naive pruning · scoring at input

Progressive pruning · post-stage scoring

Tokens active

197 / 197

Density

100%

Cumulative attn cost

1.00×

Predictions

at current stage

Dense teacher

apple

92%

PTP-FAKD student

apple

94%