Configs’ brief description#
All configurations can be found here
Here’s a brief config example
base:
seed: &seed 42 # Set random seed
model:
type: model_type # Type of the model
path: model path # Path to the model
tokenizer_mode: fast # Type of the model's tokenizer
torch_dtype: auto # Data type of the model
calib:
name: pileval # Name of the calibration dataset
download: False # Whether to download the calibration dataset online
path: calib data path # Path to the calibration dataset
n_samples: 512 # Number of samples in the calibration dataset
bs: 1 # Batch size for the calibration dataset
seq_len: 512 # Sequence length for the calibration dataset
preproc: pileval_smooth # Preprocessing method for the calibration dataset
seed: *seed # Random seed for the calibration dataset
eval:
eval_pos: [pretrain, transformed, fake_quant] # Evaluation points
name: wikitext2 # Name of the evaluation dataset
download: False # Whether to download the evaluation dataset online
path: eval data path # Path to the evaluation dataset
bs: 1 # Batch size for the evaluation dataset
seq_len: 2048 # Sequence length for the evaluation dataset
eval_token_consist: False # Whether to evaluate the consistency of tokens between the quantized and original models
quant:
method: SmoothQuant # Compression method
weight:
bit: 8 # Number of quantization bits for weights
symmetric: True # Whether weight quantization is symmetric
granularity: per_channel # Granularity of weight quantization
act:
bit: 8 # Number of quantization bits for activations
symmetric: True # Whether activation quantization is symmetric
granularity: per_token # Granularity of activation quantization
speical: # Special parameters required for the quantization algorithm. Refer to the comments in the configuration file and the original paper for usage.
save:
save_vllm: False # Whether to save the real quantized model for VLLM inference
save_sgl: False # Whether to save the real quantized model for Sglang inference
save_autoawq: False # Whether to save the real quantized model for AutoAWQ inference
save_mlcllm: False # Whether to save the real quantized model for MLC-LLM inference
save_trans: False # Whether to save the model after weight transformation
save_fake: False # Whether to save the fake quantized weights
save_path: /path/to/save # Save path
Configs’ detailed description#
base#
base.seed
Set Random Seed, which is used to set all random seeds for the entire frame
model#
model.type
The type of model, which can support Llama, Qwen2, Llava, Gemma2 and other models, you can check all the models supported by llmc from here.
model.path
Currently, LLMC only supports models in Hugging Face format, and you can use the following code to check whether the model can be loaded normally.
from transformers import AutoModelForCausalLM, AutoConfig
model_path = # model path
model_config = AutoConfig.from_pretrained(
model_path, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
config=model_config,
trust_remote_code=True,
torch_dtype="auto",
low_cpu_mem_usage=True,
)
print(model)
If the above code does not load the model you give, may be:
Your model format is not hugging face format
Your version of tansformers is too low and you can execute
pip install transformers --upgradeto upgrade it.
Before llmc runs, make sure that the above code can load your model successfully, otherwise llmc will not be able to load your model.
model.tokenizer_mode
Choose whether to use a Slow or Fast tokenizer
model.torch_dtype
You can set the data types of model weights:
auto
torch.float16
torch.bfloat16
torch.float32
where auto will follow the original data type setting of the weight file
calib#
calib.name
The name of the calibration dataset. Currently supported by the following types of calibration datasets:
pileval
wikitext2
c4
ptb
custom
where custom indicates the use of user-defined calibration datasets, refer to the Custom Calibration Dataset section of the advanced usage document for specific instructions
calib.download
Indicates whether the calibration dataset needs to be downloaded online at runtime
If you set True, you do not need to set calib.path, llmc will automatically download the dataset online
If you set False, you need to set calib.path, and llmc will read the dataset from this address, and you don’t need to run llmc on the Internet
calib.path
If calib.download is set to False, you need to set calib.path, which indicates the path where the calibration dataset is stored
The data stored in this path must be a dataset in arrow format
To download the dataset in Arrow format from Hugging Face, you can use the following code
from datasets import load_dataset
calib_dataset = load_dataset(...)
calib_dataset.save_to_disk(...)
Load datasets in that format can be used
from datasets import load_from_disk
data = load_from_disk(...)
The LLMC has provided a download script for the above dataset
The calibration dataset can be downloaded here.
The execution command is python download_calib_dataset.py --save_path [calib dataset save path]
The test dataset can be downloaded here.
The execution command is python download_eval_dataset.py --save_path [eval dataset save path]
If you want to use more datasets, you can refer to the download method of the arrow format dataset above and modify it yourself
calib.n_samples
Select n_samples pieces of data for calibration
calib.bs
Set the calibration data to calib.bs as the batch size, if it is -1, all the data is packaged into a batch of data
calib.seq_len
The sequence length of the calibration data
calib.preproc
The preprocessing methods of calibration data are currently implemented by llmc in a variety of preprocessing methods
wikitext2_gptq
ptb_gptq
c4_gptq
pileval_awq
pileval_smooth
pileval_omni
general
random_truncate_txt
With the exception of general, the rest of the preprocessing can be found here
general is implemented in the general_preproc function in the base_dataset
calib.seed
The random seed in the data preprocessing follows the base.seed setting by default
eval#
eval.eval_pos
Indicates the eval positions, and currently supports three positions that can be evaluated
pretrain
transformed
fake_quant
eval_pos need to give a list, the list can be empty, and an empty list means that no tests are being performed
eval.name
The name of the eval dataset is supported by the following types of test datasets:
wikitext2
c4
ptb
For details about how to download the test dataset, see calib.name calibration dataset
eval.download
Indicates whether the eval dataset needs to be downloaded online at runtime, see calib.download
eval.path
Refer to calib.path
eval.bs
Eval batch size
eval.seq_len
The sequence length of the eval data
eval.inference_per_block
If your model is too large and the gpu memory of a single card cannot cover the entire model during the eval, then you need to open the inference_per_block for inference, and at the same time, on the premise of not exploding the gpu memory, appropriately increase the bs to improve the inference speed.
Here’s a config example
bs: 10
inference_per_block: True
Eval multiple datasets at the same time
LLMC also supports the simultaneous evaluation of multiple datasets
Below is an example of evaluating a single wikitext2 dataset
eval:
name: wikitext2
path: wikitext2 path
Here’s an example of evaluating multiple datasets
eval:
name: [wikitext2, c4, ptb]
path: The common upper directory of these data sets
It should be noted that the names of multiple dataset evaluations need to be represented in the form of a list, and the following directory rules need to be followed
upper-level directory
wikitext2
c4
ptb
If you use the LLMC download script directly, the shared upper-level directory is the --save_path specified dataset storage path
quant#
quant.method
The names of the quantization algorithms used, and all the quantization algorithms supported by the LLMC, can be viewed here.
quant.weight
Quantization settings for weights
quant.weight.bit
The quantized number of bits of the weight
quant.weight.symmetric
Quantitative symmetry of weights
quant.weight.granularity
The quantification granularity of the weights supports the following granularities
per tensor
per channel
per group
quant.act
Activated quantization settings
quant.act.bit
Activated quantized bit digits
quant.act.symmetric
Quantified symmetry or not
quant.act.granularity
The quantization granularity of the activation supports the following granularities
per tensor
per token
per head
If quant.method is set to RTN, activating quantization can support static per tensor settings, and the following is a W8A8 configuration that activates static per tensor quantization
quant:
method: RTN
weight:
bit: 8
symmetric: True
granularity: per_channel
act:
bit: 8
symmetric: True
granularity: per_tensor
static: True
sparse#
sparse.method
The name of the sparsification algorithm used. This includes both model sparsification and reduction of visual tokens. All supported algorithms can be found in the corresponding files.
It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to TokenReduction first, and then specify the exact algorithm under special.
sparse:
method: Wanda
sparse:
method: TokenReduction
special:
method: FastV
save#
save.save_vllm
Whether to save as a VLLM inference backend-supported real quantized model.
When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the VLLM backend. This improves inference speed and reduces memory usage. For more details on the VLLM inference backend, refer to this section.
save.save_sgl
Whether to save as a Sglang inference backend-supported real quantized model.
When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the Sglang backend. This improves inference speed and reduces memory usage. For more details on the Sglang inference backend, refer to this section.
save.save_autoawq
Whether to save as an AutoAWQ inference backend-supported real quantized model.
When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the AutoAWQ backend. This improves inference speed and reduces memory usage. For more details on the AutoAWQ inference backend, refer to this section.
save.save_mlcllm
Whether to save as an MLC-LLM inference backend-supported real quantized model.
When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the MLC-LLM backend. This improves inference speed and reduces memory usage. For more details on the MLC-LLM inference backend, refer to this section.
save.save_trans
Whether to save the adjusted model weights.
The saved weights are adjusted to be more suitable for quantization, possibly containing fewer outliers. They are still saved in fp16/bf16 format (with the same file size as the original model). When deploying the model in the inference engine, the engine’s built-in naive quantization needs to be used to achieve quantized inference.
Unlike save_vllm and similar options, this option requires the inference engine to perform real quantization, while llmc provides a floating-point model weight that is more suitable for quantization.
For example, the save_trans models exported by algorithms such as SmoothQuant, Os+, AWQ, and Quarot have fewer outliers and are more suitable for quantization.
save.save_fake
Whether to save the fake quantized model.
save.save_path
The path where the model is saved. This path must be a new, non-existent directory, otherwise, LLMC will terminate the run and issue an appropriate error message.