CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications
{{output}}
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios a... ...