当前位置: 首页 > news >正文

【实验报告】sglang,vllm,transformers 在强制串行推理场景下的表现

我们现在考虑若干强制串行的需求。也就是说,必须推理完这个之后再推理下一个。

  • 调包范围是 transformers,vllm,sglang

  • 投机采样/不使用投机采样。

    投机采样对应 eagle3。容易找到一些英文语料训练的 eaglehead。注意:英文语聊训练的 eaglehead 在中文 prompt 表现极差,但是仍然可以让 accept-length > 1。

  • base model 是 qwen3-8b,运行的机器是单卡 l40。被huggingface 上动辄几百 tps 的实验结果吓哭了推理参数是 temperature 不等于 0 的,虽然可能模型输出不一样,但显然不影响 tps 的统计。精度全都是 16 位。

    推理主要的参数应当上面提到了,但实际上有很多影响因素,很难完全控制变量,由于这是一篇随手札记,那先这样。

  • 指标方面只看 token per second,主要是感受量级,我懒得做 mean\(\pm\) std 这种统计了。

  • 输入了 11 条 prompt

  1. transformers + eagle3
    也就是直接把 EAGLE 的 github repo 克隆下来,用它们的 eagenerate 来生成
    由于没有官方的计时工具,所以计算 tps 的方法是,计算 eagenerate 的运行时间,计算生成了几个 token,直接除。

    Generation time: 19.689866304397583s for 1128 tokens, speed: 57.28835242258957 tokens/s
    Generation time: 24.006053924560547s for 1469 tokens, speed: 61.192897617257664 tokens/s
    Generation time: 33.72415637969971s for 2217 tokens, speed: 65.73922784127895 tokens/s
    Generation time: 24.192238330841064s for 1477 tokens, speed: 61.05263927220292 tokens/s
    Generation time: 21.344391345977783s for 1268 tokens, speed: 59.40670686957521 tokens/s
    Generation time: 16.566300868988037s for 1122 tokens, speed: 67.72785360311629 tokens/s
    Generation time: 25.769388437271118s for 1559 tokens, speed: 60.49813730717671 tokens/s
    Generation time: 35.69959473609924s for 2051 tokens, speed: 57.4516325790679 tokens/s
    Generation time: 24.897949934005737s for 1422 tokens, speed: 57.11313597180247 tokens/s
    Generation time: 16.427077054977417s for 855 tokens, speed: 52.04821266367253 tokens/s
    Generation time: 26.052607536315918s for 1550 tokens, speed: 59.49500439982387 tokens/s
    

    显存开销是标准的 base_model 的开销+ eagle_head 的开销+ 预留的 max_length 个 kv-cache 的开销

  2. vllm + nothing

    [00:27<00:00, 27.20s/it, est. speed input: 103.04 toks/s, output: 42.86 toks/s]
    [00:15<00:00, 15.15s/it, est. speed input: 166.97 toks/s, output: 42.70 toks/s]
    [00:34<00:00, 34.53s/it, est. speed input: 82.83 toks/s, output: 42.89 toks/s]
    [00:38<00:00, 38.48s/it, est. speed input: 80.00 toks/s, output: 42.73 toks/s]
    [00:19<00:00, 19.70s/it, est. speed input: 147.42 toks/s, output: 42.59 toks/s]
    [00:36<00:00, 36.50s/it, est. speed input: 102.99 toks/s, output: 42.27 toks/s]
    [00:24<00:00, 24.61s/it, est. speed input: 107.77 toks/s, output: 42.91 toks/s]
    [00:51<00:00, 51.44s/it, est. speed input: 57.66 toks/s, output: 42.77 toks/s]
    [00:33<00:00, 33.09s/it, est. speed input: 89.75 toks/s, output: 42.76 toks/s]
    [00:36<00:00, 36.10s/it, est. speed input: 89.97 toks/s, output: 42.60 toks/s]
    [00:51<00:00, 51.08s/it, est. speed input: 55.79 toks/s, output: 42.83 toks/s]
    
  3. sglang + nothing

    Decode batch, #running-req: 1, #token: 3778, token usage: 0.46, cuda graph: True, gen throughput (token/s): 44.34
    Decode batch, #running-req: 1, #token: 4284, token usage: 0.52, cuda graph: True, gen throughput (token/s): 44.16
    Decode batch, #running-req: 1, #token: 4780, token usage: 0.58, cuda graph: True, gen throughput (token/s): 43.91
    Decode batch, #running-req: 1, #token: 5326, token usage: 0.65, cuda graph: True, gen throughput (token/s): 43.71
    Decode batch, #running-req: 1, #token: 4643, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.93
    Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.13
    Decode batch, #running-req: 1, #token: 4644, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.94
    Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.07
    Decode batch, #running-req: 1, #token: 4418, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.11
    Decode batch, #running-req: 1, #token: 5092, token usage: 0.62, cuda graph: True, gen throughput (token/s): 43.81
    Decode batch, #running-req: 1, #token: 5012, token usage: 0.61, cuda graph: True, gen throughput (token/s): 43.84
    

    感觉稍微比 vllm + nothing 好 1tps,这很不显著,而且可能是由于采样偏差带来的。所以我们忽略。

  4. vllm + eagle3

    [00:14<00:00, 14.67s/it, est. speed input: 225.82 toks/s, output: 56.59 toks/s]
    [00:23<00:00, 23.12s/it, est. speed input: 107.67 toks/s, output: 59.56 toks/s]
    [00:30<00:00, 30.37s/it, est. speed input: 75.89 toks/s, output: 62.78 toks/s]
    [00:30<00:00, 30.48s/it, est. speed input: 85.35 toks/s, output: 60.75 toks/s]
    [00:21<00:00, 21.64s/it, est. speed input: 142.03 toks/s, output: 60.78 toks/s]
    [00:31<00:00, 31.46s/it, est. speed input: 108.05 toks/s, output: 69.17 toks/s]
    [00:32<00:00, 32.62s/it, est. speed input: 95.64 toks/s, output: 62.65 toks/s]
    [00:39<00:00, 39.54s/it, est. speed input: 83.36 toks/s, output: 61.13 toks/s]
    [00:31<00:00, 31.13s/it, est. speed input: 106.44 toks/s, output: 61.07 toks/s]
    [00:30<00:00, 30.32s/it, est. speed input: 101.60 toks/s, output: 62.59 toks/s]
    

    加上 eagle3 之后 token per second 从之前的 42~43 暴力提升到了现在的 59~62

    github issue 上有一些对这个优化效果的提问。因为这个近 50% 的提升其实是远低于预期的。可以参考下面的 sglang+eagle 的运行效率

  5. sglang + eagle

    Decode batch, #running-req: 1, #token: 4365, token usage: 0.53, accept len: 3.45, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 80.28,
    Decode batch, #running-req: 1, #token: 3652, token usage: 0.45, accept len: 3.23, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 75.12,
    Decode batch, #running-req: 1, #token: 4962, token usage: 0.61, accept len: 4.22, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 97.56,
    Decode batch, #running-req: 1, #token: 5539, token usage: 0.68, accept len: 3.08, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 71.04,
    Decode batch, #running-req: 1, #token: 5156, token usage: 0.63, accept len: 3.42, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 79.04,
    Decode batch, #running-req: 1, #token: 4107, token usage: 0.50, accept len: 4.38, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.87,
    Decode batch, #running-req: 1, #token: 4976, token usage: 0.61, accept len: 4.00, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 92.61,
    Decode batch, #running-req: 1, #token: 4957, token usage: 0.61, accept len: 4.40, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.80,
    Decode batch, #running-req: 1, #token: 4508, token usage: 0.55, accept len: 4.65, accept rate: 0.08, cuda graph: True, gen throughput (token/s): 108.06,
    Decode batch, #running-req: 1, #token: 4950, token usage: 0.60, accept len: 4.10, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 94.86,
    Decode batch, #running-req: 1, #token: 5085, token usage: 0.62, accept len: 3.65, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 84.59,
    

    tps 直接翻了 1~1.5 番,效果真是卓群。


由于一些原因我们可以进行 2 并发。

EAGLE3 的 repo 没有提供 batchsize \(\neq\) 1 的实现。我也懒得写了。所以 transformers + eagle 实验数据缺失

http://www.jsqmd.com/news/51011/

相关文章:

  • what is A
  • 夺命雷公狗—好用的截图工具分享
  • 2025 完整 AI 模型核心用法速查表 - 智慧园区
  • 实验 3
  • pandas创建多sheets excel文件
  • 直接load Qwen2_5OmniThinkerForConditionalGeneration进行推理时eos token失灵的问题,导致不断生成重复token直至max new tokens触发
  • 第三章 哈希表part01
  • 2025年11月睫毛假发拉丝机,拉丝机,扫把丝拉丝机厂家权威推荐,细丝拉丝技术实力与口碑解析!
  • 2025年11月混凝土增强纤维丝拉丝机,睫毛假发拉丝机,拉丝机厂家权威推荐,耐磨性能与精度测评!
  • 2025年11月MBBR管材设备,PPR管材设备,PE管材设备公司推荐,管材机械专业制造与品牌保障口碑之选
  • 2025年11月PE管材设备,PPR管材设备,PVC管材设备厂商推荐:聚焦管材机械企业综合实力与核心技术
  • 使用.NET开发并上线一个小智AI对话机器人的MCP服务转接平台
  • nginx 代理的请求头设置
  • 全国最好的有机农场推荐——德芳有机农场
  • 从网页复制变化内容的一个简单方法
  • 2025年11月PMMA板片生产线,EVA板片生产线,PET板片生产线厂家权威推荐,透明板材设备品质红榜发布!
  • 2025年11月管道除锈设备,管道3pe设备,管道内壁喷粉设备厂家推荐,防腐工艺与适配管径测评!
  • 实验三 类和对象
  • 2025年11月管道除锈设备,管道涂塑设备,管道设备厂家品牌榜,严苛工况适配性深度解析!
  • 2025年11月钢管涂塑设备,钢管3PE设备,钢管防腐设备厂商推荐:聚焦机械制造实力与核心技术竞争力
  • 2025年11月管道3pe设备,管道设备,管道涂塑设备厂家权威推荐,96小时连续运行稳定性实测!
  • 2025年11月钢管粉末喷涂设备,钢管设备,钢管3PE设备公司推荐,专业防腐机械制造与品质保障之选
  • 2025年11月光伏热镀锌螺栓,热镀锌螺栓,外六角热镀锌螺栓厂家最新推荐,光伏工程紧固品质排名与采购攻略!
  • 2025年11月地脚螺栓预埋件,热镀锌地脚螺栓,光伏地脚螺栓厂家推荐,抗腐蚀与适配场景实用指南!
  • 二手电商技术架构准备
  • 2025年11月电力热镀锌螺栓,10.9级热镀锌螺栓,热镀锌螺栓厂家权威测评,高强度防腐紧固品牌优选榜单!
  • 2025年11月软包隔音门,录音棚隔音门,静音隔音门厂家最新推荐,定制化隔音方案解析
  • 2025年11月电力热镀锌螺栓,防腐热镀锌螺栓,10.9级热镀锌螺栓厂家品牌榜,工业紧固技术实力与口碑红榜!
  • 2025年11月钢结构地脚螺栓,9字型地脚螺栓,地脚螺栓厂家优选,工业级品质与工程案例实景呈现!
  • 2025年11月静音隔音门,钢质隔音门,实验室隔音门厂家品牌推荐,隔音等级权威测评