As language models grow in size and complexity, the need for efficient quantization becomes more pressing. However, one challenge that has emerged is the presence of outlier channels: feature dimensions whose values are orders of magnitude higher than others. While important for model performance, these outlier channels can pose significant challenges for low-bitwidth integer quantization, which is necessary for achieving speed-ups.
Figure 1: (Top) Average of the absolute activation values of a KV projection layer for a 1B language
model trained with (a) standard training, (b) QAT with learned clipping values in the input layer, and
(c) QAT on the inputs and kurtosis regularization on the layer’s outputs. For the QAT runs, we show
the learned clip value as a green 2d manifold. (Bottom) Parameter values of individual weights in the
KV projection of the same layer corresponding to each model after training. QAT-only training results
in the model’s weights’ becoming harder to quantize, whereas kurtosis regularization mitigates this.
In this blog post, we will discuss the problem of accurate quantization for language models and how to improve it using activation regularization. Specifically, we will cover the following topics:
1. The problem of activation quantization
2. The emergence of outlier channels in language models
3. A simple strategy for mitigating the effect of outlier channels
4. Results and implications
The problem of activation quantization
To enable the use of low-bitwidth integer matrix multiplications, both the activations and weights of language models need to be quantized. However, the presence of high outlier values in the model activations results in high quantization errors, resulting in overall poor post-training quantization (PTQ) accuracy.
Existing works have explored various approaches to mitigate the effect of outlier channels for activation quantization at the per-tensor level. However, achieving INT4 quantization with PTQ methods remains an open challenge, with current methods still facing nontrivial degradations in perplexity.
The emergence of outlier channels in language models
In a pretraining perspective, outlier channels are found to emerge relatively early in training, particularly in the output projection layer of the first layer and the query-key-value projection layers of the other layers. These outlier channels are also more prevalent in layers with residual streams.
A simple strategy for mitigating the effect of outlier channels
To mitigate the effect of outlier channels, a simple strategy is proposed that regularizes a layer’s input and output activations. On the input side, a quantization-aware training (QAT) approach is used to learn clipping values for each activation layer. Meanwhile, the output activations are further regularized via the kurtosis, which discourages the creation of outliers wholesale.
Results and implications
The proposed approach is able to learn a W4A4 language model at a reasonable scale (1 billion parameters trained on 20B tokens) that is competitive with the standard-precision W16A16 baseline. This suggests that activation regularization can be a viable solution for mitigating the effect of outlier channels in language model quantization.
In conclusion, outlier channels pose a significant challenge for language model quantization. However, with a better understanding of their emergence and a simple strategy for activation regularization, we can mitigate their effect and achieve competitive results with low-bitwidth integer quantization.
References and source