Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self-attention blocks. However, unlike Convolutional Neural Networks (CNNs), ViT’s simple archi- tecture has no informative inductive bias (e.g., locality). This causes ViTs to require a large amount of data for pre-training. Various data-efficient approaches (DeiT) have been proposed to train a ViT on balanced data effectively. However, limited literature discusses the use of ViT for datasets with long- tailed imbalances. In this work, we introduce DeiT-LT for tackling the problem of training ViTs from scratch on long- tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token, by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from flat CNN teachers, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token be- comes an expert on the tail classes and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features related to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViTs from scratch on datasets ranging from small- scale CIFAR-10 LT to large-scale iNaturalist-2018.