IECCT 2026 · Paper ID 1523

Progressive Token Pruning with Feature-Aware Distillation for ViTs

Omswaroop T M · Sushmith S Mokashi · Bathala Harsha · Shashikala H K · Pranshu Jain

School of Computer Science and Engineering, Jain (Deemed-to-be University), Bengaluru

Top-1 accuracy
89.60%
CIFAR-100, ViT-B/16
Above teacher
+2.57
percentage points
Attention FLOPs
0.265×
73.5% reduction
Parameters added
0
over base ViT-B/16

The problem

Vision Transformers process images as sequences of patch tokens. Self-attention scales quadratically with sequence length — for ViT-B/16 on a 224×224 image, that's 38,416 pairwise scores per attention head per layer. Token pruning reduces this cost, but pruning at the input layer discards tokens before they carry semantic meaning. The wrong tokens get cut and accuracy drops.

Our approach

PTP — Progressive Token Pruning

Prune progressively at stages 1/4 and 2/3 of the encoder, scoring tokens by L₂-norm of their post-stage hidden representations. No additional learned parameters.

FAKD — Feature-Aware KD

Align CLS representations between sparse student and frozen dense teacher exactly at pruning boundaries, where divergence is structurally greatest.

Key finding

A controlled 2×2 ablation isolates pruning timing as the dominant accuracy contributor. Progressive pruning without any distillation already reaches 88.98% — beating the dense teacher by 1.95 points. Distillation adds a further 0.62. The schedule, not the supervision, is what drives the gain.

Why it matters

The 73.5% reduction in attention FLOPs translates directly to lower inference latency in edge deployment scenarios — mobile devices, embedded systems, real-time video pipelines — where quadratic attention is the primary throughput bottleneck. Because PTP-FAKD adds no parameters, it applies directly to pretrained ViT-B/16 checkpoints without architectural changes, and the framework partitions naturally to deeper models like ViT-L/16 (24 blocks, stages of 8) and ViT-H/14 (32 blocks, stages of ~11).