
# Installing LLMC

```
git clone https://github.com/ModelTC/llmc.git
cd llmc/
pip install -r requirements.txt
```

# Preparing the Model

**LLMC** currently supports only `hugging face` format models. For example, you can find the `Qwen2-0.5B` model [here](https://huggingface.co/Qwen/Qwen2-0.5B). Instructions for downloading can be found [here](https://zhuanlan.zhihu.com/p/663712983).

For users in Mainland China, you can also use the [hugging face mirror](https://hf-mirror.com/).

An example of a simple download can be:

```
pip install -U hf-transfer

HF_ENDPOINT=https://hf-mirror.com HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen2-0.5B --local-dir Qwen2-0.5B
```

# Downloading the Dataset

**LLMC** requires datasets which are categorized into `calibration datasets` and `evaluation datasets`. The `calibration dataset` can be downloaded [here](https://github.com/ModelTC/llmc/blob/main/tools/download_calib_dataset.py) and the `evaluation dataset` can be downloaded [here](https://github.com/ModelTC/llmc/blob/main/tools/download_eval_dataset.py).

Additionally, **LLMC** supports downloading datasets online, by setting `download` to True in the `config`.

```yaml
calib:
    name: pileval
    download: True
```

# Setting Configuration Files

All `configuration files` can be found [here](https://github.com/ModelTC/llmc/blob/main/configs/), and details on the `configuration files` can be referenced [in this section](https://llmc-en.readthedocs.io/en/latest/configs.html). For example, the SmoothQuant `config` is available [here](https://github.com/ModelTC/llmc/blob/main/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml).

```yaml
base:
    seed: &seed 42
model:
    type: Qwen2 # Set model name, supporting models like Llama, Qwen2, Llava, Gemma2, etc.
    path: # Set the model weight path
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: # Set calibration dataset path
    n_samples: 512
    bs: 1
    seq_len: 512
    preproc: pileval_smooth
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: # Set evaluation dataset path
    bs: 1
    seq_len: 2048
quant:
    method: SmoothQuant
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
    act:
        bit: 8
        symmetric: True
        granularity: per_token
save:
    save_vllm: True # If set to True, the real quantized integer model is saved for inference with VLLM engine
    save_trans: False # If set to True, adjusted floating-point weights will be saved
    save_path: ./save
```

For more options and details about `save`, please refer to [this section](https://llmc-en.readthedocs.io/en/latest/configs.html).

**LLMC** provides many [algorithm configuration files](https://github.com/ModelTC/llmc/tree/main/configs/quantization/methods) under the `configs/quantization/methods` path for reference.

# Running LLMC

**LLMC** does not require installation; simply modify the `local path` of **LLMC** in the [run script](https://github.com/ModelTC/llmc/blob/main/scripts/run_llmc.sh) as follows:

```bash
llmc=/path/to/llmc
export PYTHONPATH=$llmc:$PYTHONPATH
```

You need to modify the configuration path in the [run script](https://github.com/ModelTC/llmc/blob/main/scripts/run_llmc.sh) according to the algorithm you want to run. For example, `${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml` refers to the SmoothQuant quantization configuration file. `task_name` specifies the name of the `log file` generated by **LLMC** during execution.

```bash
task_name=smooth_w_a
config=${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml
```

Once you have modified the LLMC path and config path in the run script, execute it:

```bash
bash run_llmc.sh
```

# Quantization Inference

If you have set the option to save `real quantized` models in the configuration file, such as `save_vllm: True`, then the saved `real quantized models` can be directly used for inference with the corresponding `inference backends`. For more details, refer to the `Backend` section of the [documentation](https://llmc-en.readthedocs.io/en/latest).

# FAQ

**<font color=red> Q1 </font>** 

ValueError: Tokenizer class xxx does not exist or is not currently imported.

**<font color=green> Solution </font>** 

pip install transformers --upgrade

**<font color=red> Q2 </font>** 

If you are running a large model and a single gpu card cannot store the entire model, then the gpu memory will be out during eval.

**<font color=green> Solution </font>** 

Use per block for inference, turn on inference_per_block, and increase bs appropriately to improve inference speed without exploding the gpu memory.
```
bs: 10
inference_per_block: True
```

**<font color=red> Q3 </font>** 

Exception: ./save/transformed_model existed before. Need check.

**<font color=green> Solution </font>** 

The saving path is an existing directory and needs to be changed to a non-existing saving directory.
