当前位置：首页 > news >正文

SenseVoice-Small语音识别模型ONNX导出全流程：从HuggingFace到推理部署

news 2026/3/26 22:32:02

SenseVoice-Small语音识别模型ONNX导出全流程：从HuggingFace到推理部署

1. 项目概述与核心价值

SenseVoice-Small是一个专注于高精度多语言语音识别的先进模型，它不仅支持语音转文字，还具备情感识别和音频事件检测能力。这个模型经过超过40万小时的多语言数据训练，支持50多种语言，在实际测试中表现优于Whisper模型。

核心优势亮点：

多语言识别：覆盖中文、粤语、英语、日语、韩语等50多种语言
富文本输出：不仅转写文字，还能识别情感和音频事件（音乐、掌声、笑声等）
极速推理：采用非自回归端到端框架，10秒音频仅需70毫秒处理时间
易于部署：提供完整的服务部署方案，支持多种客户端语言

2. 环境准备与模型获取

2.1 系统要求与依赖安装

在开始之前，确保你的系统满足以下要求：

# 创建虚拟环境（推荐） python -m venv sensevoice-env source sensevoice-env/bin/activate # Linux/Mac # 或 sensevoice-env\Scripts\activate # Windows # 安装核心依赖 pip install torch torchaudio pip install modelscope onnx onnxruntime pip install gradio soundfile

2.2 下载SenseVoice-Small模型

从ModelScope获取预训练模型：

from modelscope import snapshot_download model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch') print(f"模型下载到: {model_dir}")

3. ONNX模型导出流程

3.1 准备导出脚本

创建ONNX导出脚本，将PyTorch模型转换为ONNX格式：

import torch from modelscope.models import Model from modelscope.preprocessors import Preprocessor # 加载原始模型 model = Model.from_pretrained('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch') model.eval() # 创建示例输入 dummy_input = torch.randn(1, 16000) # 1秒音频，16kHz采样率 # 导出为ONNX格式 torch.onnx.export( model, dummy_input, "sensevoice_small.onnx", export_params=True, opset_version=13, do_constant_folding=True, input_names=['audio_input'], output_names=['text_output'], dynamic_axes={ 'audio_input': {0: 'batch_size', 1: 'audio_length'}, 'text_output': {0: 'batch_size', 1: 'text_length'} } ) print("ONNX模型导出完成")

3.2 模型量化（可选但推荐）

为了提升推理速度并减少内存占用，可以对ONNX模型进行量化：

import onnx from onnxruntime.quantization import quantize_dynamic, QuantType # 加载导出的ONNX模型 model_path = "sensevoice_small.onnx" quantized_model_path = "sensevoice_small_quantized.onnx" # 动态量化 quantize_dynamic( model_path, quantized_model_path, weight_type=QuantType.QUInt8 ) print("模型量化完成")

4. 使用Gradio构建前端界面

4.1 创建WebUI应用

基于Gradio构建用户友好的语音识别界面：

import gradio as gr import numpy as np from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # 初始化推理管道 asr_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' ) def transcribe_audio(audio_path): """语音识别函数""" if audio_path is None: return "请先上传或录制音频" # 执行语音识别 result = asr_pipeline(audio_path) return result['text'] # 创建Gradio界面 with gr.Blocks(title="SenseVoice语音识别") as demo: gr.Markdown("# 🎙️ SenseVoice-Small 语音识别演示") gr.Markdown("上传音频文件或直接录制语音，体验多语言语音识别") with gr.Row(): with gr.Column(): audio_input = gr.Audio( sources=["upload", "microphone"], type="filepath", label="上传或录制音频" ) btn = gr.Button("开始识别", variant="primary") with gr.Column(): text_output = gr.Textbox( label="识别结果", lines=5, placeholder="识别结果将显示在这里..." ) # 示例音频 gr.Examples( examples=["example_audio1.wav", "example_audio2.wav"], inputs=audio_input, label="示例音频" ) btn.click( fn=transcribe_audio, inputs=audio_input, outputs=text_output ) # 启动服务 if __name__ == "__main__": demo.launch(server_name="0.0.0.0", server_port=7860)

4.2 界面功能说明

主要功能区域：

音频输入区：支持文件上传和实时录音
控制按钮：开始识别触发推理过程
结果显示区：显示识别出的文字内容
示例音频：提供测试用的示例文件

5. 模型推理与性能优化

5.1 ONNX Runtime推理

使用ONNX Runtime进行高效推理：

import onnxruntime as ort import numpy as np import soundfile as sf class SenseVoiceONNX: def __init__(self, model_path): # 创建ONNX Runtime会话 self.session = ort.InferenceSession( model_path, providers=['CPUExecutionProvider'] # 可根据硬件选择CUDA/TensorRT ) def preprocess_audio(self, audio_path): """音频预处理""" audio, sr = sf.read(audio_path) # 重采样到16kHz（如果需要） if sr != 16000: # 这里可以添加重采样逻辑 pass # 标准化音频数据 audio = audio.astype(np.float32) / 32768.0 # 假设是16位PCM return audio.reshape(1, -1) def infer(self, audio_path): """执行推理""" processed_audio = self.preprocess_audio(audio_path) # 运行ONNX模型 inputs = {self.session.get_inputs()[0].name: processed_audio} outputs = self.session.run(None, inputs) return outputs[0] # 使用示例 onnx_model = SenseVoiceONNX("sensevoice_small_quantized.onnx") result = onnx_model.infer("test_audio.wav") print(f"识别结果: {result}")

5.2 性能优化技巧

提升推理速度的方法：

# 1. 使用量化模型 quantized_session = ort.InferenceSession( "sensevoice_small_quantized.onnx", providers=['CPUExecutionProvider'] ) # 2. 批量处理（如果支持） def batch_inference(audio_paths): batch_inputs = np.concatenate([preprocess_audio(path) for path in audio_paths]) outputs = session.run(None, {input_name: batch_inputs}) return outputs # 3. 使用GPU加速（如果可用） gpu_session = ort.InferenceSession( model_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'] )

6. 实际应用场景示例

6.1 多语言语音转录

SenseVoice-Small支持多种语言的语音转文字：

def multi_language_transcription(audio_path, language='auto'): """ 多语言语音转录 language: 'zh'（中文）, 'en'（英文）, 'ja'（日文）, 'ko'（韩文）, 'auto'（自动检测） """ # 这里可以根据语言选择不同的处理策略 result = asr_pipeline(audio_path) return { 'text': result['text'], 'language': result.get('language', 'unknown'), 'confidence': result.get('confidence', 0.9) }

6.2 情感识别集成

结合情感识别功能：

def analyze_speech_with_emotion(audio_path): """语音识别带情感分析""" # 语音识别 asr_result = asr_pipeline(audio_path) # 这里可以添加情感分析逻辑 emotion_result = { 'emotion': 'positive', # 示例值 'confidence': 0.85 } return { 'transcription': asr_result['text'], 'emotion': emotion_result['emotion'], 'emotion_confidence': emotion_result['confidence'] }

7. 常见问题与解决方案

7.1 模型加载问题

问题1：模型下载失败

# 解决方案：使用国内镜像源 pip install modelscope -i https://mirror.baidu.com/pypi/simple

问题2：内存不足

# 解决方案：使用量化模型或减少批处理大小 quantized_model = SenseVoiceONNX("sensevoice_small_quantized.onnx")

7.2 音频处理问题

问题：音频格式不支持

def convert_audio_format(input_path, output_path, target_sr=16000): """转换音频格式到标准格式""" import librosa audio, sr = librosa.load(input_path, sr=target_sr) sf.write(output_path, audio, target_sr) return output_path