利用llama-vulkan版本测试腾讯混元Hy-MT2多语言翻译模型
先到hf-mirror网站下载GGUF格式模型,https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/tree/main, modelscope网站还未提供此格式, https://modelscope.cn/models/Tencent-Hunyuan/Hy-MT2-1.8B
下载如下文件:
C:\d>curl -LO https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/resolve/main/Hy-MT2-1.8B-Q4_K_M.gguf -C - % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1365 0 1365 0 0 1408 0 0 100 1.05G 100 1.05G 0 0 7.64M 0 02:21 02:21 8.39M再到llama.cpp的github存储库,下载最新版本llama预编译可执行文件,选择vulkan版本,与cpu版本的区别就是多了一个56MB的ggml-vulkan.dll,它会自动检测显卡类型。
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-cpu-x64.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 00:03 0 22 15.18M 22 3.35M 0 0 33555 0 07:54 01:44 06:10 30056^C C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-vulkan-x64.zip -C - % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 0 100 31.17M 100 31.17M 0 0 35429 0 15:22 15:22 38437为了看懂基准测试输出,摘录这里的参数含义
参数
Q4_0 是什么
Q4_0 是一种 4-bit 量化格式。它的意义不是“模型更强”,而是“模型更小、更省显存、更容易塞进更多设备里”。这些榜单大多统一用 Llama 2 7B, Q4_0,核心目的是减少变量,让不同 GPU 的成绩更容易横向比较。
pp512 是什么
pp512 一般可以理解为 prompt processing 512 tokens,也就是处理 512 个输入 token 时的吞吐。
pp = prompt processing
512 = 输入长度是 512 token
t/s = tokens per second
它更像“吃提示词的速度”,通常能并行得更充分,所以数字往往很高。
tg128 是什么
tg128 一般可以理解为 text generation 128 tokens,也就是连续生成 128 个 token 时的速度。
tg = text generation
128 = 连续生成 128 token
t/s = tokens per second
它更接近我们平时感受到的“模型回答快不快”。因为生成阶段是逐 token 递推,所以通常明显低于 pp512。
基准测试:
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0 load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | pp512 | 592.38 ± 13.29 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | tg128 | 45.02 ± 0.42 | build: 47c0eda9d (9279)可见,它检测出了我的集成显卡AMD Radeon 780M Graphics。把ggml-vulkan.dll文件改名,重新执行,这次后台就是CPU,pp512减少了近一半,tg128保持不变。
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0 load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | pp512 | 339.36 ± 10.26 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | tg128 | 45.39 ± 0.11 | build: 47c0eda9d (9279)参阅文档,https://juejin.cn/post/7382216166486540339,了解到:
-ngl N, --n-gpu-layers N:
当使用GPU支持编译时,此选项允许将一些层卸载到GPU进行计算。
通常会提高性能。
现在这个参数为0,再恢复文件,去掉-ngl 0参数
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | pp512 | 844.50 ± 9.69 | | hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | tg128 | 59.84 ± 0.31 | build: 47c0eda9d (9279)这次pp512和tg128都比Vulkan -ngl 0提升了30%。
运行一个 completion 示例
C:\d\llama260522>llama-completion --model ..\Hy-MT2-1.8B-Q4_K_M.gguf -p "Translate the following segment into Chinese, without additional explanation:Hello" --jinja -ngl 0 -n 64 -st 0.00.078.290 I llama_completion: llama backend init 0.00.078.296 I llama_completion: load the model and apply lora adapter, if any 0.00.078.303 I common_init_result: fitting params to device memory ... 0.00.078.304 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.00.408.458 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect 0.09.586.475 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.20.135.187 I llama_completion: llama threadpool init, n_threads = 8 0.20.136.500 I llama_completion: chat template is available, enabling conversation mode (disable it with -no-cnv) 0.20.136.506 W *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead? 0.20.148.699 I llama_completion: chat template example: <|hy_begin▁of▁sentence|>You are a helpful assistant<|hy_place▁holder▁no▁3|><|hy_User|>Hello<|hy_Assistant|>Hi there<|hy_place▁holder▁no▁2|><|hy_User|>How are you?<|hy_Assistant|> 0.20.148.709 I 0.20.149.456 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.20.149.458 I 0.20.161.695 I sampler seed: 3367966364 0.20.161.906 I sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900 0.20.162.115 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 0.20.162.118 I generate: n_ctx = 262144, n_batch = 2048, n_predict = 64, n_keep = 0 0.20.162.118 I Translate the following segment into Chinese, without additional explanation��Hello你好 [end of text] 0.21.819.501 I common_perf_print: sampling time = 0.64 ms 0.21.819.505 I common_perf_print: samplers time = 0.09 ms / 17 tokens 0.21.819.506 I common_perf_print: load time = 19767.05 ms 0.21.819.511 I common_perf_print: prompt eval time = 1611.80 ms / 15 tokens ( 107.45 ms per token, 9.31 tokens per second) 0.21.819.513 I common_perf_print: eval time = 36.48 ms / 1 runs ( 36.48 ms per token, 27.41 tokens per second) 0.21.819.514 I common_perf_print: total time = 1684.68 ms / 16 tokens 0.21.819.515 I common_perf_print: unaccounted time = 35.77 ms / 2.1 % (total - sampling - prompt eval - eval) / (total) 0.21.819.516 I common_perf_print: graphs reused = 0 C:\d\llama260522>用CLI测试, 不知为何,翻译了一句就退出。用读入文件的方法也一样,翻译了一句就退出。
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf Loading model... / C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st Loading model... build : b9279-47c0eda9d model : Hy-MT2-1.8B-Q4_K_M.gguf modalities : text > 请将以下文本准确翻译为英文。 Please translate the text accurately into English. [ Prompt: 9.9 t/s | Generation: 50.7 t/s ] Exiting... C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st Loading model... build : b9279-47c0eda9d model : Hy-MT2-1.8B-Q4_K_M.gguf modalities : text > /read ..\eng.txt Loaded text from '..\eng.txt' > 译成中文 --- 文件:..\eng.txt --- 简要总结:Lance 是一种开放性的 Lakehouse 格式,专为 AI 工作负载设计。LanceDB 与 DuckDB Labs 合作,让您能够直接在 DuckDB SQL 中执行快速向量和混合搜索,而无需中断您的分析工作流 [ Prompt: 49.9 t/s | Generation: 43.1 t/s ] Exiting...