[Day 76] Finishing the Retrieval-based LM talk, and learning about distillation, quantization and pruning

 Hello :)
Today is Day 76!


  • Finished section 6 and 7 about multilingual retrieval-based LMs and retrieval-based LMs' challenges and opportunities (ACL 2023)
  • Covered lecture 11 of CMU 11-711 Advanced NLP - Distillation, Quantization, and Pruning

Section 6:





Section 7:





Lecture 11: Distillation, Quantization, and Pruning

Problem: The best models for NLP tasks are massive. So how can we cheaply, efficiently and equitably deploy NLP systems at the expense of performance?

Answer: Model compression

Quantization - keep the model the same but reduce the number of bits
Pruning - remove parts of the model while retaining performance
Distillation - train a smaller model to imitate the larger model

Quantization - no parameters are changed, up to k bits of precision

Amongst other methods, we can use post-training quantization

We can binarize the parameters and activations 

Pruning - a number of parameters are set to zero, the rest are unchanged

There is Magnitute pruning (Han et al., 2015; See et al., 2016)

set to 0, a % of parameters with the least magnitute


With machine translation, researchers found that we can remove almost half the params with almost none effect on downstream tasks. This is a type of unstructured pruning, because we remove params throughout the model, anywhere we see good.

Problems with unstructured pruning:


we can make our model(vectors) and make it sparse, but if our hardware cannot take advantage of this sparsity, we don't gain anything - and this is currently an active research area.

A more currently useful idea is structured pruning (Xia et al, 2022).

Instead of picking params from the model, we remove entire components/layers from the model, example


We can remove almost half the heads of a transformers, and we get a negligent impact on performance.


Distillation - train one model (the 'student') to replicate the behaviour of another large(r) model (the 'teacher'); ~all params are changed

Train the teacher

Generate soft targets - instead of using the actual(hard) labels for training, the teacher outputs soft targets which are probability distributions over the classes

Train the student using the soft targets, and it learns to mimic the behaviour of the teacher by minimizing the difference between its predictions and the soft targets

Example is DistilBERT (Sanh et al., 2019)





That is all for today!
See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)