Click here to flash read.
Transformers yield state-of-the-art results across many tasks. However, their
heuristically designed architecture impose huge computational costs during
inference. This work aims on challenging the common design philosophy of the
Vision Transformer (ViT) model with uniform dimension across all the stacked
blocks in a model stage, where we redistribute the parameters both across
transformer blocks and between different structures within the block via the
first systematic attempt on global structural pruning. Dealing with diverse ViT
structural components, we derive a novel Hessian-based structural pruning
criteria comparable across all layers and structures, with latency-aware
regularization for direct latency reduction. Performing iterative pruning on
the DeiT-Base model leads to a new architecture family called NViT (Novel ViT),
with a novel parameter redistribution that utilizes parameters more
efficiently. On ImageNet-1K, NViT-Base achieves a 2.6x FLOPs reduction, 5.1x
parameter reduction, and 1.9x run-time speedup over the DeiT-Base model in a
near lossless manner. Smaller NViT variants achieve more than 1% accuracy gain
at the same throughput of the DeiT Small/Tiny variants, as well as a lossless
3.3x parameter reduction over the SWIN-Small model. These results outperform
prior art by a large margin. Further analysis is provided on the parameter
redistribution insight of NViT, where we show the high prunability of ViT
models, distinct sensitivity within ViT block, and unique parameter
distribution trend across stacked ViT blocks. Our insights provide viability
for a simple yet effective parameter redistribution rule towards more efficient
ViTs for off-the-shelf performance boost.
No creative common's license