PV-Tuning for Extreme LLM Compression at NeurIPS 24

Paper PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression has been accepted at NeurIPS 2024.

Our recent research work PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression has been accepted for presentation and proceedings at Conference on Neural Information Processing Systems (NeurIPS) 2024. The conference will be held at the Vancouver Convention Center, Vancouver, Canada.

The work will be presented in the form of an oral presentation.

This year, the NeurIPS main track received 15,671 valid paper submissions, from which the program committee accepted 25.8%. The total number of oral presentations at NeurIPS 2024 is 61 presentations.

We demonstrated that PV-Tuning can improve quantized model accuracy compared to leading existing approaches, including:

The arXiv link for the paper: https://arxiv.org/abs/2405.14852.

I was glad to work with my peers:

Prof.Peter Richtárik, Ivan Ilin, Kai Yi from KAUST AI Initiative
Prof.Dan Alistarh from IST Austria, NeuralMagic
Vladimir Malinovskii, Denis Kuznedelev from Yandex
Denis Mazur from Moscow Institute of Physics and Technology

Abstract

There has been significant interest in “extreme” compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices.

Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.

We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.

Konstantin Burlachenko, PhD

PV-Tuning for Extreme LLM Compression at NeurIPS 24

Abstract

Other Links in the Media

Blog post from Yandex: https://habr.com/ru/companies/yandex/articles/830410/