当前位置：首页 > news >正文

SenseVoice-small-ONNX入门：如何训练微调适配垂直领域（如法律/医疗）词典

news 2026/3/31 23:55:31

SenseVoice-small-ONNX入门：如何训练微调适配垂直领域（如法律/医疗）词典

1. 项目背景与价值

语音识别技术正在快速渗透到各个专业领域，但在法律、医疗等垂直行业中，通用语音识别模型往往表现不佳。专业术语、行业特定表达方式让通用模型频频出错，严重影响工作效率。

SenseVoice-small-ONNX作为一个轻量级多语言语音识别模型，通过ONNX量化技术实现了高效推理，10秒音频仅需70毫秒处理时间。但要让它在专业领域真正发挥作用，我们需要对其进行领域适配训练。

本文将手把手教你如何基于SenseVoice-small-ONNX模型，训练和微调适配法律、医疗等垂直领域的专业词典，让你的语音识别系统在专业场景中也能游刃有余。

2. 环境准备与模型部署

2.1 基础环境搭建

首先确保你的系统已经安装Python 3.8或更高版本，然后安装必要的依赖包：

# 创建专用环境 conda create -n sensevoice-finetune python=3.9 conda activate sensevoice-finetune # 安装核心依赖 pip install funasr-onnx torch torchaudio librosa pip install pandas tqdm matplotlib

2.2 模型下载与验证

SenseVoice-small-ONNX量化模型已经预先准备好，你可以直接从指定路径加载：

from funasr_onnx import SenseVoiceSmall import os # 设置模型路径 model_path = "/root/ai-models/danieldong/sensevoice-small-onnx-quant" # 验证模型是否存在 if os.path.exists(model_path): print("模型加载成功！") model = SenseVoiceSmall(model_path, batch_size=1, quantize=True) else: print("请先下载模型或检查路径")

3. 领域词典训练数据准备

3.1 法律领域词典构建

法律文档中有大量专业术语，需要专门整理和标注。以下是一个法律词典的构建示例：

# legal_terms.csv 示例内容 # term,pronunciation,weight # 最高人民法院,zuigao renmin fayuan,1.0 # 民事诉讼,minshi susong,0.9 # 刑事诉讼法,xingshi susong fa,0.9 # 司法解释,sifa jieshi,0.8 # 合同纠纷,hetong jiufen,0.85 def build_legal_dictionary(csv_path, output_path): """构建法律领域词典文件""" import pandas as pd df = pd.read_csv(csv_path) with open(output_path, 'w', encoding='utf-8') as f: for _, row in df.iterrows(): f.write(f"{row['term']}\t{row['pronunciation']}\t{row['weight']}\n") print(f"法律词典已保存至: {output_path}") # 使用示例 build_legal_dictionary('legal_terms.csv', 'legal_dict.txt')

3.2 医疗领域术语收集

医疗领域的术语更加专业和复杂，需要从多个来源收集：

def collect_medical_terms(): """从多个来源收集医疗术语""" medical_terms = [ # 疾病名称 ("糖尿病", "tangniaobing", 1.0), ("高血压", "gaoxueya", 1.0), ("冠心病", "guanxinbing", 0.9), # 医疗操作 ("核磁共振", "heci gongzhen", 0.9), ("心电图", "xindiantu", 0.85), ("腹腔镜", "fuqiangjing", 0.8), # 药物名称 ("阿司匹林", "asipilin", 0.9), ("胰岛素", "yidaosu", 0.9), ("抗生素", "kangshengsu", 0.85) ] return medical_terms # 保存医疗词典 medical_terms = collect_medical_terms() with open('medical_dict.txt', 'w', encoding='utf-8') as f: for term, pronunciation, weight in medical_terms: f.write(f"{term}\t{pronunciation}\t{weight}\n")

4. 模型微调训练实战

4.1 准备训练数据

微调需要准备领域特定的音频-文本配对数据：

def prepare_training_data(audio_dir, transcript_dir, output_file): """准备训练数据清单""" import os import json data_list = [] # 遍历音频文件 for audio_file in os.listdir(audio_dir): if audio_file.endswith('.wav'): base_name = os.path.splitext(audio_file)[0] transcript_file = os.path.join(transcript_dir, f"{base_name}.txt") if os.path.exists(transcript_file): with open(transcript_file, 'r', encoding='utf-8') as f: transcript = f.read().strip() data_list.append({ 'audio': os.path.join(audio_dir, audio_file), 'text': transcript }) # 保存训练清单 with open(output_file, 'w', encoding='utf-8') as f: for item in data_list: f.write(json.dumps(item, ensure_ascii=False) + '\n') print(f"训练数据准备完成，共{len(data_list)}条数据")

4.2 领域适配训练

使用FunASR提供的微调接口进行领域适配训练：

from funasr_onnx import SenseVoiceFineTuner def fine_tune_model(base_model_path, train_list, output_dir, domain_dict=None): """微调模型适配特定领域""" # 初始化微调器 fine_tuner = SenseVoiceFineTuner( model_path=base_model_path, output_dir=output_dir ) # 设置训练参数 train_config = { 'batch_size': 4, 'learning_rate': 1e-5, 'num_epochs': 10, 'max_duration': 20 # 最大音频时长（秒） } # 如果有领域词典，加载词典 if domain_dict: fine_tuner.load_dictionary(domain_dict) # 开始训练 fine_tuner.train( data_list=train_list, **train_config ) print(f"模型微调完成，保存至: {output_dir}") # 使用示例 fine_tune_model( base_model_path="/root/ai-models/danieldong/sensevoice-small-onnx-quant", train_list="train_data_list.json", output_dir="./fine_tuned_model", domain_dict="legal_dict.txt" )

5. 领域词典集成与优化

5.1 词典权重调整

不同术语在不同场景中的重要程度不同，需要调整权重：

def optimize_dictionary_weights(dict_path, usage_stats): """根据使用统计优化词典权重""" import pandas as pd df = pd.read_csv(dict_path, sep='\t', header=None, names=['term', 'pronunciation', 'weight']) # 根据使用频率调整权重 for term, stats in usage_stats.items(): if term in df['term'].values: # 根据识别准确率和使用频率调整权重 accuracy = stats['accuracy'] frequency = stats['frequency'] new_weight = min(1.0, accuracy * 0.7 + frequency * 0.3) df.loc[df['term'] == term, 'weight'] = new_weight # 保存优化后的词典 df.to_csv(dict_path.replace('.txt', '_optimized.txt'), sep='\t', index=False, header=False)

5.2 动态词典加载

实现运行时动态加载领域词典的功能：

class DomainAdaptedRecognizer: """领域自适应语音识别器""" def __init__(self, model_path, domain_dicts=None): self.model = SenseVoiceSmall(model_path, quantize=True) self.domain_dicts = domain_dicts or {} def load_domain_dict(self, domain_name, dict_path): """加载领域词典""" import pandas as pd df = pd.read_csv(dict_path, sep='\t', header=None, names=['term', 'pronunciation', 'weight']) self.domain_dicts[domain_name] = df.to_dict('records') def recognize_with_domain(self, audio_path, domain_name, language="auto"): """使用特定领域词典进行识别""" domain_terms = self.domain_dicts.get(domain_name, []) # 这里简化处理，实际应该集成到模型推理过程中 result = self.model([audio_path], language=language, use_itn=True) # 对结果进行领域术语后处理 processed_result = self._postprocess_with_domain(result[0], domain_terms) return processed_result def _postprocess_with_domain(self, text, domain_terms): """使用领域术语进行后处理校正""" for term in domain_terms: if term['term'] in text: # 这里可以添加更复杂的匹配和替换逻辑 pass return text

6. 效果验证与性能测试

6.1 识别准确率评估

对比微调前后的识别效果：

def evaluate_domain_accuracy(test_data, recognizer, domain_name): """评估领域识别准确率""" correct_count = 0 total_count = len(test_data) for test_item in test_data: audio_path = test_item['audio'] expected_text = test_item['text'] # 使用领域适配识别 result = recognizer.recognize_with_domain(audio_path, domain_name) # 计算相似度（简化处理） similarity = calculate_similarity(result, expected_text) if similarity > 0.8: # 相似度阈值 correct_count += 1 accuracy = correct_count / total_count print(f"领域 {domain_name} 识别准确率: {accuracy:.2%}") return accuracy def calculate_similarity(text1, text2): """计算文本相似度""" from difflib import SequenceMatcher return SequenceMatcher(None, text1, text2).ratio()

6.2 性能影响测试

测试领域词典对推理速度的影响：

import time def test_performance_impact(recognizer, audio_path, domain_name, num_runs=10): """测试领域词典对性能的影响""" # 不使用领域词典的基准测试 base_times = [] for _ in range(num_runs): start_time = time.time() recognizer.model([audio_path]) base_times.append(time.time() - start_time) # 使用领域词典的测试 domain_times = [] for _ in range(num_runs): start_time = time.time() recognizer.recognize_with_domain(audio_path, domain_name) domain_times.append(time.time() - start_time) avg_base = sum(base_times) / num_runs avg_domain = sum(domain_times) / num_runs overhead = (avg_domain - avg_base) / avg_base * 100 print(f"基准推理时间: {avg_base:.3f}s") print(f"领域推理时间: {avg_domain:.3f}s") print(f"性能开销: {overhead:.1f}%") return avg_base, avg_domain, overhead

7. 实际应用案例

7.1 法律咨询场景应用

在法律咨询场景中，准确识别法律术语至关重要：

class LegalConsultationRecognizer: """法律咨询语音识别专用类""" def __init__(self, model_path): self.recognizer = DomainAdaptedRecognizer(model_path) self.recognizer.load_domain_dict('legal', 'legal_dict_optimized.txt') def transcribe_legal_conversation(self, audio_path): """转录法律咨询对话""" result = self.recognizer.recognize_with_domain(audio_path, 'legal') # 法律文本后处理 processed_text = self._legal_text_postprocess(result) return processed_text def _legal_text_postprocess(self, text): """法律文本专用后处理""" # 添加法律文档格式处理 # 确保术语准确性 # 格式化法律条文引用 return text

7.2 医疗诊断场景应用

在医疗场景中，需要确保医学术语的准确识别：

class MedicalRecordRecognizer: """医疗记录语音识别专用类""" def __init__(self, model_path): self.recognizer = DomainAdaptedRecognizer(model_path) self.recognizer.load_domain_dict('medical', 'medical_dict_optimized.txt') def transcribe_medical_notes(self, audio_path, doctor_specialty=None): """转录医疗记录""" result = self.recognizer.recognize_with_domain(audio_path, 'medical') # 根据医生专科进行额外处理 if doctor_specialty: result = self._specialty_specific_processing(result, doctor_specialty) return result def _specialty_specific_processing(self, text, specialty): """专科特定的后处理""" # 不同专科可能有不同的术语偏好 specialty_dicts = { 'cardiology': self._load_cardiology_terms(), 'neurology': self._load_neurology_terms() } if specialty in specialty_dicts: return self._apply_specialty_terms(text, specialty_dicts[specialty]) return text