PV-Tuning for Extreme LLM Compression
The new paper PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression with a new fine-tuning technique for highly-compressed LLMs.
Our paper PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression has been published. We demonstrate that PV-Tuning can improve quantized model accuracy compared to leading existing approaches, including:
The arXiv link for the paper: https://arxiv.org/abs/2405.14852.
I was glad to work with my peers:
- Prof.Peter Richtárik, Ivan Ilin, Kai Yi from KAUST AI Initiative
- Prof.Dan Alistarh from IST Austria, NeuralMagic
- Vladimir Malinovskii, Denis Kuznedelev from Yandex
- Denis Mazur from Moscow Institute of Physics and Technology
Abstract
There has been significant interest in “extreme” compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices.
Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.
We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.
Other Links in the Media
- Prof.Dan Alistarh’s Twitter post: https://twitter.com/DAlistarh/status/1796530164215820766
- Prof.Peter Richtarik’s post in the news: https://richtarik.org/index.html
- The official implementation: https://github.com/Vahe1994/AQLM/tree/pv-tuning