Distillation is the process of using a larger model to train smaller models. This has been shown to be more effective than training small models from scratch [1]. It can involve using intermediate states of the larger model to assist the smaller model, or using large generative models to produce new text from which the smaller model is trained on.

Image credits: C. Hsieh et al, https://arxiv.org/pdf/2305.02301.pdf

Related Articles

No items found.