当前位置：首页 > news >正文

大模型测试方法

news 2026/6/23 4:17:36

前言

随着这几年大模型的发展，笔者旁边越来越多的AI从业者承认（笔者从很久前就承认了）做好大模型只属于极少数大脑算力超标的人的工作。对于更多的人，最重要的是关注如何把大模型用好。

而用好大模型的一个前提是，掌握对于大模型的测试方法。本文会侧重于记录业界主流的对于大模型推理服务的测试方法。

vLLM benchmarks.serve

vllm是目前市占率最高的开源大模型推理引擎, benchmarks.serve是 vLLM官方自带的性能基准测试工具，专门用来测试 vLLM 推理服务的吞吐量、延迟、并发能力。关键指标有：

吞吐：req/s（每秒请求数）、token/s（每秒生成 tokens）
延迟：
- TTFT（Time To First Token）: 首字延迟（用户感知最关键）
- TPOT (Time Per Output Token) : 每生成token耗时
- ITL（Inter Token Latency）: token间延迟
- P95/P99 尾延迟：极端情况表现（决定用户体验下限）

调用示例如下：

# 最大输入token为1024，最大输出token为1024，并发数为26 # vllm bench serve \ --backend openai-chat \ --base-url http://100.124.110.110:8008 \ --endpoint /v1/chat/completions \ --dataset-name random \ --model test \ --tokenizer /data/Qwen/Qwen3.5-397B-A17B-w8a8-mtp/Qwen3.5-397B-A17B-w8a8-mtp/ \ --seed 1024 \ --random-input-len 1024 \ --random-output-len 1024 \ --num-prompts 104 \ --max-concurrency 26 \ --request-rate inf \ --metric-percentiles 95,99 \ --trust-remote-code \ --save-result \ --result-filename /workspace/bench_1k1k_concurrency26.json INFO 04-03 02:36:14 [__init__.py:44] Available plugins for group vllm.platform_plugins: INFO 04-03 02:36:14 [__init__.py:46] - ascend -> vllm_ascend:register INFO 04-03 02:36:14 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 04-03 02:36:14 [__init__.py:239] Platform plugin ascend is activated INFO 04-03 02:36:23 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader` INFO 04-03 02:36:23 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork` Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0xffff0d3a8540>, trust_remote_code=True, seed=1024, num_prompts=104, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat', base_url='http://100.124.17.133:8008', host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=26, model='test', input_len=None, output_len=None, tokenizer='/data/Qwen/Qwen3.5-397B-A17B-w8a8-mtp/Qwen3.5-397B-A17B-w8a8-mtp/', tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=True, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename='/workspace/bench_1k1k_concurrency26.json', ignore_eos=False, percentile_metrics=None, metric_percentiles='95,99', goodput=None, request_id_prefix='bench-29601d17-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False) INFO 04-03 02:36:25 [datasets.py:631] Sampling input_len from [1024, 1024] and output_len from [1024, 1024] WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0. Starting initial single prompt test run... Skipping endpoint ready check. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: 26 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [04:03<00:00, 2.34s/it] tip: install termplotlib and gnuplot to plot the metrics ============ Serving Benchmark Result ============ Successful requests: 104 Failed requests: 0 Maximum request concurrency: 26 Benchmark duration (s): 243.64 Total input tokens: 106496 Total generated tokens: 106496 Request throughput (req/s): 0.43 Output token throughput (tok/s): 437.10 Peak output token throughput (tok/s): 520.00 Peak concurrent requests: 45.00 Total token throughput (tok/s): 874.20 ---------------Time to First Token---------------- Mean TTFT (ms): 2486.02 Median TTFT (ms): 2457.77 P95 TTFT (ms): 4266.63 P99 TTFT (ms): 4273.27 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 57.09 Median TPOT (ms): 57.04 P95 TPOT (ms): 58.37 P99 TPOT (ms): 58.89 ---------------Inter-token Latency---------------- Mean ITL (ms): 57.11 Median ITL (ms): 55.33 P95 ITL (ms): 59.11 P99 ITL (ms): 67.50 ==================================================

AISBench

AISBench是由中国电子技术标准化研究院发起的，华为、浪潮等企业参与共建的大模型 / 服务器性能与精度评测基准。上手使用方式，可参照 https://github.com/AISBench/benchmark

AISBench 也原生支持 vLLM 服务压测，可参照上述代码仓库的“readme”，完成安装及配置文件编辑后，运行下面的命令进行第一次评测。

ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example

如出现报错，且日志文件出现如下内容

FileExistsError: Dataset path: /root/ais_bench/benchmark/datasets/utils/../../../../ais_bench/datasets/gsm8k is not exist!

可参考 https://gitee.com/aisbench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/demo/README.md

完成数据集下载，再重新运行。

查看全文

http://www.jsqmd.com/news/699056/

2026年天津汽车园与天津汽车城一站式选购指南：101汽车文化广场如何重塑买车用车体验 - 年度推荐企业名录

2026大模型学习路线：从零基础到工程落地，适配高薪岗位

【AI绘画创作瓶颈】的【平民化解决方案】：kohya_ss让你【零门槛定制专属AI画师】

2026点选验证码终极实战：OCR+语义匹配双路径，目标检测模型全流程部署落地

嘉立创EDA入门实战：从零搭建首个开关电源原理图

ISO三体系认证代办多少钱一次？ - 品牌企业推荐师（官方）

三分钟拆解UDS刷写：34/36/37服务实战与S19文件数据映射

告别理论！用一张‘眼图’看懂你的GTX链路信号质量（误码率、抖动、噪声容限全解析）

3分钟快速迁移：艾尔登法环存档角色转移终极解决方案

高端封边机怎么选？2026硬核选型干货｜看懂这些不踩坑 - 星辉数控

嵌入式团队还在用Keil/JLink Commander？VSCode 2026插件已打通CI/CD流水线：Git Push → 自动构建 → 烧录至产线设备（实测3.2秒完成）

PDPS镜像对象全解析：从基础操作到高级布局应用

如何3分钟完成Windows和Office智能激活？KMS_VL_ALL_AIO终极指南

特斯拉Model 3/Y CAN总线DBC文件：终极数据解析与车辆监控指南

人类微生物组研究的终极解决方案：如何用curatedMetagenomicData快速完成标准化分析

2026年天津汽车城一站式服务平台深度横评：新能源销售、改装维保与摩托车文化完全指南 - 年度推荐企业名录

太原市尖草坪区宇馨家具：太原沙发椅翻新电话多少 - LYL仔仔

收藏｜2026年程序员必看：学会用大模型，轻松提升竞争力

别再傻傻print了！用tqdm给你的Python脚本加个进度条，代码瞬间专业

Kohya_SS稳定扩散训练器：如何突破AI艺术创作的技术瓶颈？

上海靠谱的ISO体系认证代办公司推荐 - 品牌企业推荐师（官方）

【收藏备用】2026年大模型岗位拆解+零基础入门指南，程序员转型/小白入行必看（附全套学习资料）

本地化语义代码搜索实践：基于EmbeddingGemma与FAISS的Claude Code集成方案

杭州市钱塘区杭来环保科技：性价比高的杭州水下打捞公司 - LYL仔仔

终极指南：5分钟为现代游戏添加专业级CRT复古显示效果

【Flutter for OpenHarmony第三方库】Flutter for OpenHarmony 数据统计与用户行为分析功能适配与实现指南

保姆级图解：UCIe D2D Adapter 在芯片互连中到底干了啥？（从参数协商到可靠传输）

太原市尖草坪区致尚家具维修：太原沙发软包机构 - LYL仔仔

如何快速免费分析无人机飞行日志？5分钟掌握UAV Log Viewer终极指南

告别VSCode C++插件卡顿！用clangd 17.0.3打造丝滑开发环境（Mac/Linux/Windows全攻略）

前言

vLLM benchmarks.serve

AISBench

相关文章：