Installing LLMC#

git clone https://github.com/ModelTC/llmc.git
cd llmc/
pip install -r requirements.txt

Preparing the Model#

LLMC currently supports only hugging face format models. For example, you can find the Qwen2-0.5B model here. Instructions for downloading can be found here.

For users in Mainland China, you can also use the hugging face mirror.

An example of a simple download can be:

pip install -U hf-transfer

HF_ENDPOINT=https://hf-mirror.com HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen2-0.5B --local-dir Qwen2-0.5B

Downloading the Dataset#

LLMC requires datasets which are categorized into calibration datasets and evaluation datasets. The calibration dataset can be downloaded here and the evaluation dataset can be downloaded here.

Additionally, LLMC supports downloading datasets online, by setting download to True in the config.

calib:
    name: pileval
    download: True

Setting Configuration Files#

All configuration files can be found here, and details on the configuration files can be referenced in this section. For example, the SmoothQuant config is available here.

base:
    seed: &seed 42
model:
    type: Qwen2 # Set model name, supporting models like Llama, Qwen2, Llava, Gemma2, etc.
    path: # Set the model weight path
    torch_dtype: auto
calib:
    name: pileval
    download: False
    path: # Set calibration dataset path
    n_samples: 512
    bs: 1
    seq_len: 512
    preproc: pileval_smooth
    seed: *seed
eval:
    eval_pos: [pretrain, transformed, fake_quant]
    name: wikitext2
    download: False
    path: # Set evaluation dataset path
    bs: 1
    seq_len: 2048
quant:
    method: SmoothQuant
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
    act:
        bit: 8
        symmetric: True
        granularity: per_token
save:
    save_vllm: True # If set to True, the real quantized integer model is saved for inference with VLLM engine
    save_trans: False # If set to True, adjusted floating-point weights will be saved
    save_path: ./save

For more options and details about save, please refer to this section.

LLMC provides many algorithm configuration files under the configs/quantization/methods path for reference.

Running LLMC#

LLMC does not require installation; simply modify the local path of LLMC in the run script as follows:

llmc=/path/to/llmc
export PYTHONPATH=$llmc:$PYTHONPATH

You need to modify the configuration path in the run script according to the algorithm you want to run. For example, ${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml refers to the SmoothQuant quantization configuration file. task_name specifies the name of the log file generated by LLMC during execution.

task_name=smooth_w_a
config=${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml

Once you have modified the LLMC path and config path in the run script, execute it:

bash run_llmc.sh

Quantization Inference#

If you have set the option to save real quantized models in the configuration file, such as save_vllm: True, then the saved real quantized models can be directly used for inference with the corresponding inference backends. For more details, refer to the Backend section of the documentation.

FAQ#

Q1

ValueError: Tokenizer class xxx does not exist or is not currently imported.

Solution

pip install transformers –upgrade

Q2

If you are running a large model and a single gpu card cannot store the entire model, then the gpu memory will be out during eval.

Solution

Use per block for inference, turn on inference_per_block, and increase bs appropriately to improve inference speed without exploding the gpu memory.

bs: 10
inference_per_block: True

Q3

Exception: ./save/transformed_model existed before. Need check.

Solution

The saving path is an existing directory and needs to be changed to a non-existing saving directory.