Presentation
MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction
DescriptionMixed-precision quantization has shown to be a promising method for enhancing the efficiency of LLMs. This technique boosts computational efficiency by processing most values with low-precision, high-throughput compute units and maintains accuracy by processing outliers in high-precision. However, due to the dynamic, irregular, and sparse nature of outliers, this approach is far from using hardware efficiently.
In this work, we propose MIXQ, an efficient mixed-precision quantization system. Through our in-depth analysis of outlier distribution, we introduce a locality-based outlier prediction algorithm that can predict all outliers of 95.8% of tokens. Based on this accurate prediction, we propose a quantization ahead of detection(QAD) technique that can verify the correctness of prediction. A new data structure is proposed for efficient outlier processing. Evaluation shows that
MIXQ achieves 1.52x and 1.78x speedup over FP16 and Bitsandbytes on 8-bit quantization; plus 1.48x, 1.93x, and 6x speedup over QUIK, FP16, and AWQ on 4-bit quantization.
In this work, we propose MIXQ, an efficient mixed-precision quantization system. Through our in-depth analysis of outlier distribution, we introduce a locality-based outlier prediction algorithm that can predict all outliers of 95.8% of tokens. Based on this accurate prediction, we propose a quantization ahead of detection(QAD) technique that can verify the correctness of prediction. A new data structure is proposed for efficient outlier processing. Evaluation shows that
MIXQ achieves 1.52x and 1.78x speedup over FP16 and Bitsandbytes on 8-bit quantization; plus 1.48x, 1.93x, and 6x speedup over QUIK, FP16, and AWQ on 4-bit quantization.