Model accuracy test V1#
Accuracy test pipeline#
LLMC supports basic PPL (Perplexity) evaluation, but more downstream task evaluations are not supported by LLMC itself.
It is common practice to use evaluation tools to directly test the inference of the model, including but not limited to:
However, this evaluation method is not efficient, so we recommend using the inference engine evaluation tool to separate the model accuracy evaluation, the model is inferred by the inference engine, and served in the form of an API, and the evaluation tool evaluates the API. This approach has the following benefits:
Using an efficient inference engine for model inference can speed up the entire evaluation process
The reasoning of the model and the evaluation of the model are separated, and each is responsible for its own professional affairs, and the code structure is clearer
Using the inference engine to infer a model is more in line with the actual deployment scenario and easier to align with the accuracy of the actual deployment of the model
We recommend and introduce the compression-deployment-evaluation process using the following model: LLMC compression-lightllm inference-opencompass evaluation
Here are the links to the relevant tools:
Use of the lightLLM inference engine#
The official lightllm repository has more detailed documentation, but here is a simple and quick start
start a service of a float model
install lightllm
git clone https://github.com/ModelTC/lightllm.git
cd lightllm
pip install -v -e .
start a service
python -m lightllm.server.api_server --model_dir # model path \
--host 0.0.0.0 \
--port 1030 \
--nccl_port 2066 \
--max_req_input_len 6144 \
--max_req_total_len 8192 \
--tp 2 \
--trust_remote_code \
--max_total_token_num 120000
The above command will serve a 2-card on port 1030 of the machine
The above commands can be set by the number of tp, and TensorParallel inference can be performed on tp cards, which is suitable for inference of larger models.
The max_total_token_num in the above command will affect the throughput performance during the test, and can be set according to the lightllm documentation. As long as the gpu memory is not exploded, the larger the setting, the better.
If you want to set up multiple lightllm services on the same machine, you need to reset the port and nccl_port above without conflicts.
Simple testing of the service
Execute the following python script
import requests
import json
url = 'http://localhost:1030/generate'
headers = {'Content-Type': 'application/json'}
data = {
'inputs': 'What is AI?',
"parameters": {
'do_sample': False,
'ignore_eos': False,
'max_new_tokens': 128,
}
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(response.json())
else:
print('Error:', response.status_code, response.text)
If the above script returns normally, the service is normal
start a service of a quantization model
python -m lightllm.server.api_server --model_dir 模型路径 \
--host 0.0.0.0 \
--port 1030 \
--nccl_port 2066 \
--max_req_input_len 6144 \
--max_req_total_len 8192 \
--tp 2 \
--trust_remote_code \
--max_total_token_num 120000 \
--mode triton_w4a16
Added to the command --mode triton_w4a16, indicates that the naive quantization of w4a16 was used
After the service is started, you also need to verify whether the service is normal
The model path used by the above command is the original pre-trained model and has not been adjusted by the llmc. You can follow the LLMC documentation, open the save_trans, save a modified model, and then run the naive quantization service command described above.
Use of the opencompass evaluation tool#
The official opencompass repository has more detailed documentation, but here is a simple and quick start
install opencompass
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -v -e .
Modify the config
The config file is here, this configuration file is used by OpenCompass to evaluate the accuracy of Lightllm’s API service, and it should be noted that the port inside it url should be consistent with the above Lightllm service port
For the selection of the evaluation dataset, you need to modify this part of the code
with read_base():
from .summarizers.leaderboard import summarizer
from .datasets.humaneval.deprecated_humaneval_gen_a82cae import humaneval_datasets
The above code snippet, which represents the test humaneval dataset, can be found here for more dataset testing support
Dataset download
It is necessary to prepare the dataset according to the OpenCompass documentation.
Run accuracy tests
After modifying the above configuration file, you can run the following command
python run.py configs/eval_lightllm.py
When the model has completed the inference and metric calculations, we can get the evaluation results of the model. The output folder will be generated in the current directory, the logs subfolder will record the logs in the evaluation, and the summary subfile will record the accuracy of the measured data set
Use of the lm-evaluation-harness evaluation tool#
Besides the above-mentioned methods, we also recommend people use lm-evaluation-harness. We have already integrated this tool in ours. After cloning the submodules of our llmc, people can refer to the following commands to evaluate the quantized model/full precision model:
export CUDA_VISIBLE_DEVICES=4,5,6,7
llmc=./llmc
lm_eval=./llmc/lm-evaluation-harness
export PYTHONPATH=$llmc:$PYTHONPATH
export PYTHONPATH=$llmc:$lm_eval:$PYTHONPATH
# Replace the config file (i.e., RTN with algorithm-transformed model path or notate quant with original model path)
# with the one you want to use. `--quarot` depends on the transformation algorithm used before.
accelerate launch --multi_gpu --num_processes 4 llmc/tools/llm_eval.py \
--config llmc/configs/quantization/RTN/rtn_quarot.yml \
--model hf \
--quarot \
--tasks lambada_openai,arc_easy \
--model_args parallelize=False \
--batch_size 64 \
--output_path ./save/lm_eval \
--log_samples
We preserve the command in lm-evaluation-harness. There are only two more arguments --config and --quarot. The former is for loading the transformed model (saved by save_trans) or the original hugginface model, depending on the model path. Otherwise, remove quant part in the config to perform evaluation for the full-precision model, and we only support RTN quant, where all related quantization granularities need to align with the setting of the transformed model. The latter is employed if the model is transformed by QuaRot.
Remark: Please cancel the paralleize (or paralleize=False) and pretrained=* in --model_args for evaluation.
FAQ#
Q1
What does the dataset configuration file in OpenCompass mean when the same dataset has different suffixes?
Solution
Different suffixes represent different prompt templates, and for detailed OpenCompass questions, please refer to the OpenCompass documentation
Q2
The test accuracy of the Humaneval of the LLAMA model is too low
Solution
You may need to delete the \n at the end of each entry in the Humaneval json file in the dataset provided by OpenCompass and retest it
Q3
The test is still not fast enough
Solution
You can consider whether the max_total_token_num parameter settings are reasonable when starting the lightllm service, and if the setting is too small, the test concurrency will be low