Activation aware quantization (AWQ)

Activation aware quantization (AWQ) is a process for quantizing large language models whilst maintaining significant accuracy without the memory overhead of Quantization Aware Training. There are difficulties quantizing very large language models due to outliers.

Outliers are weights in the network which take on very large values. These large values can skew the distribution of weights at quantization time, making it harder to maintain performance whilst reducing weight precision. AWQ accounts for these outlier values during the quantization process by calculating scale factors to offset them, thereby maintaining model performance.

Image credits: J. Lin et al, https://arxiv.org/pdf/2306.00978.pdf