Quantization

Quantization is a machine learning optimization technique which is applicable to deep neural networks. Neural networks store a large number of variables (called neurons, or weights) that encode the models knowledge of the task they are trained to perform. During training numbers must be stored at (relatively) high precision, to make sure the model learns from the data effectively. In inference time, it is often possible, without a substantial drop in model ability, to decrease the precision with which these weights are stored. This can substantially reduce the amount of space the model takes up in memory, and, with proper optimization, can also speed up the models processing of incoming data.