
# AWQ + OmniQuant

OmniQuant uses **Learnable Weight Clipping (LWC)** and **Learnable Equivalent Transformation (LET)** to optimize quantized models, often achieving better performance compared to non-learning-based algorithms. However, due to instability during training and sensitivity to hyperparameters, OmniQuant requires significant time to fine-tune the hyperparameters. This not only increases training costs but can also lead to suboptimal results.

To address these issues, we have improved OmniQuant in LLMC. We use AWQ to generate `clipping parameters` and `transformation parameters`, which are then used as initializations for OmniQuant's `LWC` and `LET`, respectively. This quality initialization significantly reduces OmniQuant's training time while improving its accuracy.

## 1.1 Weight-only Quantization

As an example of the `w4a16g128` setting, we provide a [configuration file combining AWQ and OmniQuant](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w4a16g128).

### 1.1.1 Run AWQ

**Step One**, run the AWQ-related [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w4a16g128/step_1_awq.yml). Note that in this step, you need to set the `save_trans` parameter to `True` to save the transformed model.

```yaml
# configs/quantization/combination/awq_comb_omni/w4a16g128/step_1_awq.yml

save:
    # Save the AWQ-transformed model for OmniQuant.
    save_trans: True
    save_fake: False
    save_path: /path/to/save_awq_trans/
```

Run the script:
```bash
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=step_1_awq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w4a16g128/step_1_awq.yml
```

### 1.1.2 Run OmniQuant

**Step Two**, load the AWQ-transformed model and run the OmniQuant-related [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w4a16g128/step_2_omniq.yml). In this step, set the `search_clip_init` parameter to `True` to initialize `LWC` using the `clipping parameters` generated by AWQ grid search.

```yaml
# configs/quantization/combination/awq_comb_omni/w4a16g128/step_2_omniq.yml
model:
    type: model_type
    # Load AWQ-transformed model
    path: /path/to/save_awq_trans/transformed_model
    torch_dtype: auto
```

```yaml
quant:
    special:
        search_clip_init: True
```

Run the script:
```bash
# scripts/run_llmc.sh

llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=step_2_omni
config=${llmc}/configs/quantization/combination/awq_comb_omni/w4a16g128/step_2_omniq.yml
```

By running these two steps, LLMC can achieve better results in **weight-only quantization** compared to the original OmniQuant [paper](https://arxiv.org/abs/2308.13137). More importantly, LLMC only requires 5 epochs to achieve this effect, much less than the 20 or 40 epochs required in the [original paper](https://arxiv.org/abs/2308.13137), significantly reducing training time.

Please note that in **weight-only quantization**, AWQ's `clipping parameters` and `transformation parameters` do not need to be stored for use by OmniQuant. Only a transformed model needs to be saved. This is because Learnable Equivalent Transformation (`LET`) mainly addresses the outlier phenomenon in activation quantization. Therefore, in weight-only quantization, OmniQuant does not need to use `LET`. At the same time, the use of AWQ's `clipping parameters` to initialize Learnable Weight Clipping (`LWC`) is automatically handled by OmniQuant in LLMC.

## 1.2 Weight-Activation Quantization

As an example of the `w8a8` setting, we provide a [configuration file combining AWQ and OmniQuant](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w8a8).

### 1.2.1 Run AWQ


**Step One**, run the AWQ-related [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w8a8/step_1_awq.yml). Note that in this step, you need to set the `save_clip` and `save_scale` parameters to `True` to save the `clipping parameters` and `transformation parameters`. Also, make sure to use `learnable` as the weight calibration method since only `learnable` supports saving and loading of the `clipping parameters`.

```yaml
# configs/quantization/combination/awq_comb_omni/w8a8/step_1_awq.yml
quant:
    weight:
        bit: 8
        symmetric: False
        granularity: per_channel
        group_size: -1
        calib_algo: learnable
    act:
        bit: 8
        symmetric: False
        granularity: per_token
        calib_algo: minmax
```

```yaml
save:
    save_scale: True
    scale_path: /path/to/scale
    save_clip: True
    clip_path: /path/to/clip
```

Run the script:
```bash
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=step_1_awq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w8a8/step_1_awq.yml
```

### 1.2.2 Run OmniQuant

**Step Two**, load the `clipping parameters` and `transformation parameters` generated by AWQ. In this step, the `clipping parameters` and `transformation parameters` generated by AWQ are loaded for initialization training in OmniQuant's `LWC` and `LET`. Run the OmniQuant-related [configuration file](https://github.com/ModelTC/llmc/tree/main/configs/quantization/combination/awq_comb_omni/w8a8/step_2_omniq.yml).

```yaml
# configs/quantization/combination/awq_comb_omni/w8a8/step_2_omniq.yml
quant:
    special:
        # Use AWQ's search clip factors to initialize OmniQuant's clip factors,
        # Then refine them through learning (LWC).
        search_clip_init: True
        load_clip: True
        clip_path: /path/to/scale
        # Use AWQ's search scale factors to initialize OmniQuant's scale factors,
        # Then refine them through learning (LET).
        search_scale_init: True
        scale_path: /path/to/clip
```

In this step, set both `search_scale_init` and `search_clip_init` to `True` to use the `clipping parameters` and `transformation parameters` generated by AWQ to initialize `LWC` and `LET`.

Run the script:
```bash
# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=step_2_omniq
config=${llmc}/configs/quantization/combination/awq_comb_omni/w8a8/step_2_omniq.yml
```

By running these two steps, LLMC can achieve better results in **weight-activation quantization** than those reported in the [original paper](https://arxiv.org/abs/2308.13137), and it only requires 5 epochs.