SGLang Quantized Inference#

SGLang is a fast-serving framework for large language models and vision-language models. By co-designing the backend runtime and frontend language, it makes interactions with models faster and more controllable.

1.1 Environment Setup#

To use SGLang for quantized inference, you first need to install and configure the SGLang environment:

pip install --upgrade pip
pip install "sglang[all]"

# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

1.2 Quantization Format#

Same as VLLM.

1.3 Using LLMC for Model Quantization#

1.3.1 Calibration Data#

In this section, we use the Plieval and Wikitext academic datasets as calibration data. For downloading and preprocessing calibration data, please refer to this section.

For real use cases, it is recommended to use real deployment scenario data for offline quantization calibration.

1.3.2 Choosing a Quantization Algorithm#

W8A16

Under the W8A16 quantization setting, the accuracy of large language models generally does not show significant issues. In this case, we recommend using the simplest RTN (Round to Nearest) algorithm, which does not require additional calibration steps and runs quickly.

The specific implementation can be found in the RTN W8A16 weight quantization configuration file.

# configs/quantization/backend/sglang/rtn_w8a16.yml
quant:
    method: RTN
    weight:
        bit: 8
        symmetric: True
        granularity: per_group
        group_size: 128
        need_pack: True

Please note that in this step, the need_pack parameter must be set to True, which will “pack” the 8-bit weights into the torch.int32 format for SGLang to directly load for inference.

W4A16

Under the W4A16 quantization setting, RTN (Round to Nearest) cannot ensure accuracy, so higher-order quantization algorithms are required to maintain model accuracy. In this case, we recommend using the AWQ algorithm from LLMC.

The specific implementation can be found in the AWQ W4A16 weight quantization configuration file.

# configs/quantization/backend/sglang/awq_w4a16.yml
quant:
    method: Awq
    weight:
        bit: 4
        symmetric: True
        granularity: per_group
        group_size: 128
        need_pack: True
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True  

Please note that in this step, the need_pack parameter must be set to True, which will “pack” the 4-bit weights into the torch.int32 format for SGlang to directly load for inference.

Additionally, if AWQ does not meet accuracy requirements, we recommend using the AWQ + OmniQuant combined algorithm as introduced in this section to further improve accuracy. The corresponding configuration file is also provided.

W8A8

Under the W8A8 quantization setting, we also recommend using the AWQ algorithm. AWQ generally outperforms SmoothQuant and OS+ in most cases, providing better quantization accuracy.

The specific implementation can be found in the AWQ W8A8 configuration file.

# configs/quantization/backend/sglang/awq_w8a8.yml
quant:
    method: Awq
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
        group_size: -1
    act:
        bit: 8
        symmetric: True
        granularity: per_token
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True 

Additionally, if AWQ does not meet accuracy requirements, we recommend using the Quarot + GPTQ combined algorithm as introduced in this section to further improve accuracy. The corresponding configuration file is also provided.

FP8-Dynamic

In FP8 quantization, LLMC supports weight quantization per-channel and activation quantization dynamically per-token. In this case, the RTN (Round to Nearest) algorithm is sufficient. However, we recommend using the AWQ algorithm for better quantization accuracy. For implementation details, refer to the AWQ FP8 configuration file.

# configs/quantization/backend/sglang/fp8/awq_fp8.yml
quant:
    method: Awq
    quant_type: float_quant
    weight:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_channel
        use_qtorch: True
    act:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_token
        use_qtorch: True
    special:
        trans: True
        trans_version: v2
        weight_clip: True
    quant_out: True

Ensure that quant_type is set to float_quant to indicate floating-point quantization. Additionally, set use_qtorch to True, as LLMC’s FP8 implementation depends on certain functionalities from the QPyTorch library.

Install QPyTorch with the following command:

pip install qtorch

FP8-Static

In FP8 quantization, LLMC also supports weight quantization per-tensor and activation quantization statically per-tensor. In this case, we recommend using the AWQ algorithm while adjusting the activation ranges. Refer to the AWQ FP8 static quantization configuration file.

# configs/quantization/backend/sglang/fp8/awq_fp8_static.yml
quant:
    method: Awq
    quant_type: float-quant
    weight:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_tensor
        use_qtorch: True
    act:
        # Support ["e4m3", "e5m2"]
        bit: e4m3
        symmetric: True
        granularity: per_tensor
        use_qtorch: True
        static: True

1.3.3 Exporting Real Quantized Model#

save:
    save_sgl: True
    save_path: /path/to/save_for_sglang_rtn_w8a16/

Please note that you must set save_sgl to True. For W4A16 and W8A16 quantization settings, LLMC will “pack” the weights into torch.int32 format for direct loading by SGlang, while also exporting the quantization parameters.

For the W8A8 quantization setting, LLMC will quantize the weights into torch.int8 format for direct loading by SGlang, and export the relevant quantization parameters as well.

1.3.4 Running LLMC#

Modify the configuration file path in the script and run:

# scripts/run_llmc.sh
llmc=llmc_path
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=rtn_for_sglang
config=${llmc}/configs/quantization/backend/sglang/rtn_w8a16.yml

Once LLMC finishes running, the quantized model will be stored at the save.save_path location.

1.4 Using Sglang for Inference#

1.4.1 Inference Service#

By default, it will start the server at http://localhost:10000. Replace model_path with the quantized model saved under the save.save_path.

Start the service:

python -m sglang.launch_server --model-path model_path

Call the service:

curl http://localhost:10000/generate   -H "Content-Type: application/json"   -d '{
    "text": "Once upon a time,",
    "sampling_params": {
      "max_new_tokens": 16,
      "temperature": 0
    }
  }'

Additionally, we have built an example that uses SGLang for inference.

cd examples/backend/sglang

python infer_with_sglang.py