VLLM Quantized Inference#

VLLM is an efficient backend specifically designed to meet the inference needs of large language models. By optimizing memory management and computational efficiency, it significantly speeds up the inference process.

LLMC supports exporting quantized model formats required by VLLM and, through its strong multi-algorithm support (such as AWQ, GPTQ, QuaRot, etc.), can maintain high quantization accuracy while ensuring inference speed. The combination of LLMC and VLLM enables users to achieve inference acceleration and memory optimization without sacrificing accuracy, making it ideal for scenarios requiring efficient handling of large-scale language models.

1.1 Environment Setup#

To use VLLM for quantized inference, first, install and configure the VLLM environment:

pip install vllm

1.2 Quantization Formats#

In VLLM’s fixed-point integer quantization, the following common formats are supported:

W4A16: Weights are int4, activations are float16.
W8A16: Weights are int8, activations are float16.
W8A8: Weights are int8, activations are int8.
FP8 (E4M3, E5M2): Weights are float8, activations are float8.
Per-channel/group weight quantization: Quantization applied per channel or group.
Per-tensor weight quantization: Quantization applied per tensor.
Per-token dynamic activation quantization: Dynamic quantization for each token to further improve precision.
Per-tensor static activation quantization: Static quantization for each tensor to enhance efficiency.
Symmetric weight/activation quantization: Quantization parameters include scale.

Therefore, when quantizing models with LLMC, make sure that the bit settings for weights and activations are in formats supported by VLLM.

1.3 Using LLMC for Model Quantization#

1.3.1 Calibration Data#

In this chapter, we use the Pileval and Wikitext academic datasets as calibration data. For downloading and preprocessing calibration data, refer to this chapter.

In practical use, we recommend using real deployment data for offline quantization calibration.

1.3.2 Choosing a Quantization Algorithm#

W8A16

In the W8A16 quantization setting, large language models typically do not experience significant accuracy degradation. In this case, we recommend using the simple RTN (Round to Nearest) algorithm, which does not require additional calibration steps and runs quickly.

You can refer to the RTN W8A16 weight quantization configuration file.

# configs/quantization/backend/vllm/rtn_w8a16.yml
quant:
    method: RTN
    weight:
        bit: 8
        symmetric: True
        granularity: per_group
        group_size: 128
        need_pack: True

Make sure to set the need_pack parameter to True, which packs 8-bit weights into torch.int32 format for direct VLLM loading and inference.

W4A16

In the W4A16 quantization setting, RTN (Round to Nearest) cannot ensure accuracy, so higher-order quantization algorithms are needed to maintain model accuracy. In this case, we recommend using the AWQ algorithm from LLMC.

You can refer to the AWQ W4A16 weight quantization configuration file.

# configs/quantization/backend/vllm/awq_w4a16.yml
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: True
        granularity: per_group
        group_size: 128
        need_pack: True
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True  

Make sure to set the need_pack parameter to True, which packs 4-bit weights into torch.int32 format for direct VLLM loading and inference.

If AWQ cannot meet accuracy requirements, we recommend using the AWQ + OmniQuant combination algorithm described in this chapter to further improve accuracy. The corresponding configuration file is also provided.

W8A8

In the W8A8 quantization setting, we also recommend using the AWQ algorithm. AWQ generally outperforms SmoothQuant and OS+ in most cases, providing better quantization accuracy.

You can refer to the AWQ W8A8 quantization configuration file.

# configs/quantization/backend/vllm/awq_w8a8.yml
quant:
    method: Awq
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
        group_size: -1
    act:
        bit: 8
        symmetric: True
        granularity: per_token
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True 

If AWQ cannot meet accuracy requirements, we recommend using the Quarot + GPTQ combination algorithm described in this chapter to further improve accuracy. The corresponding configuration file is also provided.

FP8-Dynamic

In FP8 quantization, LLMC supports weight quantization per-channel and activation quantization dynamically per-token. In this case, the RTN (Round to Nearest) algorithm is sufficient. However, we recommend using the AWQ algorithm for better quantization accuracy. For implementation details, refer to the AWQ FP8 configuration file.

# configs/quantization/backend/vllm/fp8/awq_fp8.yml
quant:
    method: Awq
    quant_type: float_quant
    weight:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_channel
        use_qtorch: True
    act:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_token
        use_qtorch: True
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True

Ensure that quant_type is set to float_quant to indicate floating-point quantization. Additionally, set use_qtorch to True, as LLMC’s FP8 implementation depends on certain functionalities from the QPyTorch library.

Install QPyTorch with the following command:

pip install qtorch

FP8-Static

In FP8 quantization, LLMC also supports weight quantization per-tensor and activation quantization statically per-tensor. In this case, we recommend using the AWQ algorithm while adjusting the activation ranges. Refer to the AWQ FP8 static quantization configuration file.

# configs/quantization/backend/vllm/fp8/awq_fp8_static.yml
quant:
    method: Awq
    quant_type: float-quant
    weight:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_tensor
        use_qtorch: True
    act:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_tensor
        use_qtorch: True
        static: True

1.3.3 Exporting Real Quantized Model#

save:
    save_vllm: True
    save_path: /path/to/save_for_vllm_rtn_w8a16/

Make sure to set save_vllm to True. For W4A16 and W8A16 quantization settings, LLMC will export the weights in torch.int32 format for direct VLLM loading, and it will also export the quantization parameters.

For W8A8 quantization settings, LLMC will export the weights in torch.int8 format for direct VLLM loading, along with the relevant quantization parameters.

1.3.4 Running LLMC#

Modify the configuration file path in the run script and execute:

# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=rtn_for_vllm
config=${llmc}/configs/quantization/backend/vllm/rtn_w8a16.yml

After LLMC finishes running, the real quantized model will be stored at the save.save_path.

1.4 Using VLLM for Inference#

1.4.1 Offline Inference#

We have provided an example for performing offline batch inference on a dataset using VLLM. You only need to replace the model_path in the example with the save.save_path path, and then run the following command:

cd examples/backend/vllm

python infer_with_vllm.py

1.4.2 Inference Service#

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using the OpenAI API. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. Replace model_path with the saved quantized model.

Start the server:

vllm serve model_path 

Query the server:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
    "model": "model_path",
    "prompt": "What is the AI?",
    "max_tokens": 128,
    "temperature": 0
}'

VLLM Quantized Inference

Contents