Installing LLMC#
git clone https://github.com/ModelTC/llmc.git
cd llmc/
pip install -r requirements.txt
Preparing the Model#
LLMC currently supports only hugging face format models. For example, you can find the Qwen2-0.5B model here. Instructions for downloading can be found here.
For users in Mainland China, you can also use the hugging face mirror.
An example of a simple download can be:
pip install -U hf-transfer
HF_ENDPOINT=https://hf-mirror.com HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --resume-download Qwen/Qwen2-0.5B --local-dir Qwen2-0.5B
Downloading the Dataset#
LLMC requires datasets which are categorized into calibration datasets and evaluation datasets. The calibration dataset can be downloaded here and the evaluation dataset can be downloaded here.
Additionally, LLMC supports downloading datasets online, by setting download to True in the config.
calib:
name: pileval
download: True
Setting Configuration Files#
All configuration files can be found here, and details on the configuration files can be referenced in this section. For example, the SmoothQuant config is available here.
base:
seed: &seed 42
model:
type: Qwen2 # Set model name, supporting models like Llama, Qwen2, Llava, Gemma2, etc.
path: # Set the model weight path
torch_dtype: auto
calib:
name: pileval
download: False
path: # Set calibration dataset path
n_samples: 512
bs: 1
seq_len: 512
preproc: pileval_smooth
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: False
path: # Set evaluation dataset path
bs: 1
seq_len: 2048
quant:
method: SmoothQuant
weight:
bit: 8
symmetric: True
granularity: per_channel
act:
bit: 8
symmetric: True
granularity: per_token
save:
save_vllm: True # If set to True, the real quantized integer model is saved for inference with VLLM engine
save_trans: False # If set to True, adjusted floating-point weights will be saved
save_path: ./save
For more options and details about save, please refer to this section.
LLMC provides many algorithm configuration files under the configs/quantization/methods path for reference.
Running LLMC#
LLMC does not require installation; simply modify the local path of LLMC in the run script as follows:
llmc=/path/to/llmc
export PYTHONPATH=$llmc:$PYTHONPATH
You need to modify the configuration path in the run script according to the algorithm you want to run. For example, ${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml refers to the SmoothQuant quantization configuration file. task_name specifies the name of the log file generated by LLMC during execution.
task_name=smooth_w_a
config=${llmc}/configs/quantization/methods/SmoothQuant/smoothquant_w_a.yml
Once you have modified the LLMC path and config path in the run script, execute it:
bash run_llmc.sh
Quantization Inference#
If you have set the option to save real quantized models in the configuration file, such as save_vllm: True, then the saved real quantized models can be directly used for inference with the corresponding inference backends. For more details, refer to the Backend section of the documentation.
FAQ#
Q1
ValueError: Tokenizer class xxx does not exist or is not currently imported.
Solution
pip install transformers –upgrade
Q2
If you are running a large model and a single gpu card cannot store the entire model, then the gpu memory will be out during eval.
Solution
Use per block for inference, turn on inference_per_block, and increase bs appropriately to improve inference speed without exploding the gpu memory.
bs: 10
inference_per_block: True
Q3
Exception: ./save/transformed_model existed before. Need check.
Solution
The saving path is an existing directory and needs to be changed to a non-existing saving directory.