Configs’ brief description#

All configurations can be found here

Here’s a brief config example

base:
    seed: &seed 42 # Set random seed
model:
    type: model_type # Type of the model
    path: model path # Path to the model
    tokenizer_mode: fast # Type of the model's tokenizer
    torch_dtype: auto # Data type of the model
calib:
    name: pileval # Name of the calibration dataset
    download: False # Whether to download the calibration dataset online
    path: calib data path # Path to the calibration dataset
    n_samples: 512 # Number of samples in the calibration dataset
    bs: 1 # Batch size for the calibration dataset
    seq_len: 512 # Sequence length for the calibration dataset
    preproc: pileval_smooth # Preprocessing method for the calibration dataset
    seed: *seed # Random seed for the calibration dataset
eval:
    eval_pos: [pretrain, transformed, fake_quant] # Evaluation points
    name: wikitext2 # Name of the evaluation dataset
    download: False # Whether to download the evaluation dataset online
    path: eval data path # Path to the evaluation dataset
    bs: 1 # Batch size for the evaluation dataset
    seq_len: 2048 # Sequence length for the evaluation dataset
    eval_token_consist: False # Whether to evaluate the consistency of tokens between the quantized and original models
quant:
    method: SmoothQuant # Compression method
    weight:
        bit: 8 # Number of quantization bits for weights
        symmetric: True # Whether weight quantization is symmetric
        granularity: per_channel # Granularity of weight quantization
    act:
        bit: 8 # Number of quantization bits for activations
        symmetric: True # Whether activation quantization is symmetric
        granularity: per_token # Granularity of activation quantization
    speical: # Special parameters required for the quantization algorithm. Refer to the comments in the configuration file and the original paper for usage.
save:
    save_vllm: False # Whether to save the real quantized model for VLLM inference
    save_sgl: False # Whether to save the real quantized model for Sglang inference
    save_autoawq: False # Whether to save the real quantized model for AutoAWQ inference
    save_mlcllm: False # Whether to save the real quantized model for MLC-LLM inference
    save_trans: False # Whether to save the model after weight transformation
    save_fake: False # Whether to save the fake quantized weights
    save_path: /path/to/save # Save path

Configs’ detailed description#

base#

base.seed

Set Random Seed, which is used to set all random seeds for the entire frame

model#

model.type

The type of model, which can support Llama, Qwen2, Llava, Gemma2 and other models, you can check all the models supported by llmc from here.

model.path

Currently, LLMC only supports models in Hugging Face format, and you can use the following code to check whether the model can be loaded normally.

from transformers import AutoModelForCausalLM, AutoConfig


model_path = # model path
model_config = AutoConfig.from_pretrained(
    model_path, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=model_config,
    trust_remote_code=True,
    torch_dtype="auto",
    low_cpu_mem_usage=True,
)

print(model)

If the above code does not load the model you give, may be:

Your model format is not hugging face format
Your version of tansformers is too low and you can execute pip install transformers --upgrade to upgrade it.

Before llmc runs, make sure that the above code can load your model successfully, otherwise llmc will not be able to load your model.

model.tokenizer_mode

Choose whether to use a Slow or Fast tokenizer

model.torch_dtype

You can set the data types of model weights:

auto
torch.float16
torch.bfloat16
torch.float32

where auto will follow the original data type setting of the weight file

calib#

calib.name

The name of the calibration dataset. Currently supported by the following types of calibration datasets:

pileval
wikitext2
c4
ptb
custom

where custom indicates the use of user-defined calibration datasets, refer to the Custom Calibration Dataset section of the advanced usage document for specific instructions

calib.download

Indicates whether the calibration dataset needs to be downloaded online at runtime

If you set True, you do not need to set calib.path, llmc will automatically download the dataset online

If you set False, you need to set calib.path, and llmc will read the dataset from this address, and you don’t need to run llmc on the Internet

calib.path

If calib.download is set to False, you need to set calib.path, which indicates the path where the calibration dataset is stored

The data stored in this path must be a dataset in arrow format

To download the dataset in Arrow format from Hugging Face, you can use the following code

from datasets import load_dataset
calib_dataset = load_dataset(...)
calib_dataset.save_to_disk(...)

Load datasets in that format can be used

from datasets import load_from_disk
data = load_from_disk(...)

The LLMC has provided a download script for the above dataset

The calibration dataset can be downloaded here.

The execution command is python download_calib_dataset.py --save_path [calib dataset save path]

The test dataset can be downloaded here.

The execution command is python download_eval_dataset.py --save_path [eval dataset save path]

If you want to use more datasets, you can refer to the download method of the arrow format dataset above and modify it yourself

calib.n_samples

Select n_samples pieces of data for calibration

calib.bs

Set the calibration data to calib.bs as the batch size, if it is -1, all the data is packaged into a batch of data

calib.seq_len

The sequence length of the calibration data

calib.preproc

The preprocessing methods of calibration data are currently implemented by llmc in a variety of preprocessing methods

wikitext2_gptq
ptb_gptq
c4_gptq
pileval_awq
pileval_smooth
pileval_omni
general
random_truncate_txt

With the exception of general, the rest of the preprocessing can be found here

general is implemented in the general_preproc function in the base_dataset

calib.seed

The random seed in the data preprocessing follows the base.seed setting by default

eval#

eval.eval_pos

Indicates the eval positions, and currently supports three positions that can be evaluated

pretrain
transformed
fake_quant

eval_pos need to give a list, the list can be empty, and an empty list means that no tests are being performed

eval.name

The name of the eval dataset is supported by the following types of test datasets:

wikitext2
c4
ptb

For details about how to download the test dataset, see calib.name calibration dataset

eval.download

Indicates whether the eval dataset needs to be downloaded online at runtime, see calib.download

eval.path

Refer to calib.path

eval.bs

Eval batch size

eval.seq_len

The sequence length of the eval data

eval.inference_per_block

If your model is too large and the gpu memory of a single card cannot cover the entire model during the eval, then you need to open the inference_per_block for inference, and at the same time, on the premise of not exploding the gpu memory, appropriately increase the bs to improve the inference speed.

Here’s a config example

bs: 10
inference_per_block: True

Eval multiple datasets at the same time

LLMC also supports the simultaneous evaluation of multiple datasets

Below is an example of evaluating a single wikitext2 dataset

eval:
    name: wikitext2
    path: wikitext2 path

Here’s an example of evaluating multiple datasets

eval:
    name: [wikitext2, c4, ptb]
    path: The common upper directory of these data sets

It should be noted that the names of multiple dataset evaluations need to be represented in the form of a list, and the following directory rules need to be followed

upper-level directory
- wikitext2
- c4
- ptb

If you use the LLMC download script directly, the shared upper-level directory is the --save_path specified dataset storage path

quant#

quant.method

The names of the quantization algorithms used, and all the quantization algorithms supported by the LLMC, can be viewed here.

quant.weight

Quantization settings for weights

quant.weight.bit

The quantized number of bits of the weight

quant.weight.symmetric

Quantitative symmetry of weights

quant.weight.granularity

The quantification granularity of the weights supports the following granularities

per tensor
per channel
per group

quant.act

Activated quantization settings

quant.act.bit

Activated quantized bit digits

quant.act.symmetric

Quantified symmetry or not

quant.act.granularity

The quantization granularity of the activation supports the following granularities

per tensor
per token
per head

If quant.method is set to RTN, activating quantization can support static per tensor settings, and the following is a W8A8 configuration that activates static per tensor quantization

quant:
    method: RTN
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
    act:
        bit: 8
        symmetric: True
        granularity: per_tensor
        static: True

sparse#

sparse.method

The name of the sparsification algorithm used. This includes both model sparsification and reduction of visual tokens. All supported algorithms can be found in the corresponding files.

It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to TokenReduction first, and then specify the exact algorithm under special.

sparse:
    method: Wanda

sparse:
    method: TokenReduction
    special:
        method: FastV

save#

save.save_vllm

Whether to save as a VLLM inference backend-supported real quantized model.

When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the VLLM backend. This improves inference speed and reduces memory usage. For more details on the VLLM inference backend, refer to this section.

save.save_sgl

Whether to save as a Sglang inference backend-supported real quantized model.

When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the Sglang backend. This improves inference speed and reduces memory usage. For more details on the Sglang inference backend, refer to this section.

save.save_autoawq

Whether to save as an AutoAWQ inference backend-supported real quantized model.

When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the AutoAWQ backend. This improves inference speed and reduces memory usage. For more details on the AutoAWQ inference backend, refer to this section.

save.save_mlcllm

Whether to save as an MLC-LLM inference backend-supported real quantized model.

When this option is enabled, the saved model weights will significantly shrink (real quantization), and it can be directly loaded for inference using the MLC-LLM backend. This improves inference speed and reduces memory usage. For more details on the MLC-LLM inference backend, refer to this section.

save.save_trans

Whether to save the adjusted model weights.

The saved weights are adjusted to be more suitable for quantization, possibly containing fewer outliers. They are still saved in fp16/bf16 format (with the same file size as the original model). When deploying the model in the inference engine, the engine’s built-in naive quantization needs to be used to achieve quantized inference.

Unlike save_vllm and similar options, this option requires the inference engine to perform real quantization, while llmc provides a floating-point model weight that is more suitable for quantization.

For example, the save_trans models exported by algorithms such as SmoothQuant, Os+, AWQ, and Quarot have fewer outliers and are more suitable for quantization.

save.save_fake

Whether to save the fake quantized model.

save.save_path

The path where the model is saved. This path must be a new, non-existent directory, otherwise, LLMC will terminate the run and issue an appropriate error message.

Configs’ brief description

Contents