Compression in AI typically refers to model compression. Model compression helps to reduce the size of the neural network, without significant accuracy loss. There are four main types of model compression often used by machine learning engineers: 

1) Quantization

2) Pruning

3) Knowledge distillation

4) Low-rank factorization

At TitanML, as part of our Titan Takeoff Inference Server offering, a number of accuracy-preserving compression techniques have been built in, in order to allow for large language deployment everywhere, including on-prem.

Related Articles

No items found.