当前位置：首页 > news >正文

Qwen3-ASR-1.7B实战教程：与Qwen3-ForcedAligner-0.6B联用方案

news 2026/4/28 15:23:38

Qwen3-ASR-1.7B实战教程：与Qwen3-ForcedAligner-0.6B联用方案

1. 引言：从语音到字幕，你需要一个完整的方案

如果你正在寻找一个能离线运行、支持多语言的语音识别工具，那么Qwen3-ASR-1.7B可能已经进入了你的视线。它能准确地把音频转成文字，支持中文、英文、日语、韩语等多种语言，而且完全在本地运行，数据安全有保障。

但你可能也发现了它的一个“短板”——它只告诉你说了什么，却不告诉你什么时候说的。对于制作视频字幕、分析会议发言节奏、或者做语音标注来说，没有时间戳就像只有菜谱没有烹饪时间，总觉得少了点什么。

这就是我们今天要解决的问题。我将带你一步步搭建一个完整的语音处理流水线：先用Qwen3-ASR-1.7B把音频转成文字，再用Qwen3-ForcedAligner-0.6B为每个词、每句话打上精确的时间戳。最终你会得到一个类似这样的输出：

[00:00:01.200 - 00:00:03.500] 大家好，欢迎来到今天的会议 [00:00:03.600 - 00:00:05.800] 我们今天要讨论三个主要议题

学习目标：

学会独立部署和使用Qwen3-ASR-1.7B语音识别模型
掌握Qwen3-ForcedAligner-0.6B时间戳对齐模型的部署方法
理解如何将两个模型串联起来，实现从音频到带时间戳文字的完整流程
获得可直接复用的代码示例和部署脚本

前置知识：只需要基础的Python知识，知道怎么运行命令行，了解什么是API调用就够了。不需要语音处理或深度学习的专业知识。

2. 第一步：部署Qwen3-ASR-1.7B语音识别模型

2.1 快速部署与验证

Qwen3-ASR-1.7B的部署非常简单，基本上就是“点几下鼠标，等几分钟”的事情。

部署步骤：

选择镜像：在你的云平台或本地环境中，找到名为ins-asr-1.7b-v1的镜像
启动实例：点击“部署”按钮，系统会自动创建运行环境
等待启动：首次启动需要15-20秒加载模型参数，状态变为“已启动”就可以用了

验证是否正常工作：

部署完成后，打开浏览器访问http://你的实例IP:7860，你会看到一个简洁的测试页面。按这个流程测试一下：

# 这不是代码，而是操作步骤描述： 1. 在页面上传一个WAV格式的音频文件（手机录音转成WAV就行） 2. 语言选择“auto”（自动检测）或“zh”（中文） 3. 点击“开始识别”按钮 4. 等待1-3秒，看右侧是否显示转写结果

如果看到类似下面的输出，说明ASR模型工作正常：

识别结果 ━━━━━━━━━━━━━━━━━━━ 识别语言：Chinese 识别内容：今天的天气真不错，我们出去走走吧 ━━━━━━━━━━━━━━━━━━━

2.2 通过API调用ASR服务

虽然Web界面很方便，但我们要做自动化处理，所以需要学会通过API调用。模型启动后，除了7860端口的Web界面，还有一个7861端口的API服务。

Python调用示例：

import requests import json def transcribe_audio(audio_file_path, language="auto"): """ 调用Qwen3-ASR-1.7B API进行语音转写 参数： audio_file_path: WAV音频文件路径 language: 语言代码，可选 "zh", "en", "ja", "ko", "yue", "auto" 返回： 转写后的文本 """ # API地址（假设服务运行在本地7861端口） api_url = "http://localhost:7861/transcribe" # 准备请求数据 files = { 'audio': open(audio_file_path, 'rb') } data = { 'language': language } try: # 发送请求 response = requests.post(api_url, files=files, data=data) if response.status_code == 200: result = response.json() return result.get('text', '') else: print(f"请求失败，状态码：{response.status_code}") return None except Exception as e: print(f"调用API时出错：{str(e)}") return None finally: files['audio'].close() # 使用示例 if __name__ == "__main__": # 转写中文音频 text = transcribe_audio("meeting_chinese.wav", language="zh") print(f"转写结果：{text}") # 自动检测语言 text_auto = transcribe_audio("presentation.wav", language="auto") print(f"自动检测转写：{text_auto}")

关键参数说明：

音频格式：必须是WAV格式，单声道，建议16kHz采样率
语言选择：
- zh：中文（普通话）
- en：英文
- ja：日语
- ko：韩语
- yue：粤语
- auto：自动检测（推荐使用）
文件大小：建议单文件小于50MB，时长小于5分钟

2.3 处理常见问题

在实际使用中，你可能会遇到这些问题：

问题1：音频格式不支持

# 解决方案：使用pydub库转换格式 from pydub import AudioSegment def convert_to_wav(input_file, output_file="converted.wav"): """ 将常见音频格式转换为WAV格式 支持：mp3, m4a, flac, ogg等 """ audio = AudioSegment.from_file(input_file) audio = audio.set_channels(1) # 转为单声道 audio = audio.set_frame_rate(16000) # 设为16kHz audio.export(output_file, format="wav") return output_file # 使用示例 wav_file = convert_to_wav("recording.mp3") text = transcribe_audio(wav_file)

问题2：长音频处理

def split_long_audio(audio_file, segment_duration=300): """ 将长音频分割为小段（默认每段5分钟） 参数： audio_file: 音频文件路径 segment_duration: 每段时长（秒） 返回： 分割后的文件路径列表 """ from pydub import AudioSegment import os audio = AudioSegment.from_wav(audio_file) duration_ms = len(audio) segment_ms = segment_duration * 1000 output_files = [] base_name = os.path.splitext(audio_file)[0] for i in range(0, duration_ms, segment_ms): segment = audio[i:i + segment_ms] if len(segment) < 1000: # 小于1秒的片段跳过 continue output_file = f"{base_name}_part{i//segment_ms + 1}.wav" segment.export(output_file, format="wav") output_files.append(output_file) return output_files # 使用示例：处理30分钟会议录音 segments = split_long_audio("long_meeting.wav", segment_duration=180) # 每段3分钟 for segment in segments: text = transcribe_audio(segment) print(f"片段转写：{text}")

3. 第二步：部署Qwen3-ForcedAligner-0.6B时间戳对齐模型

3.1 对齐模型的作用与部署

Qwen3-ASR-1.7B告诉我们“说了什么”，Qwen3-ForcedAligner-0.6B则告诉我们“什么时候说的”。这个对齐模型会分析音频波形和转写文本，为每个词、每个句子打上精确的时间戳。

部署对齐模型：

找到对应镜像：寻找名为ins-aligner-qwen3-0.6b-v1的镜像
部署启动：和ASR模型类似，点击部署等待启动
访问服务：对齐模型通常运行在7862端口（具体以镜像说明为准）

对齐模型的核心功能：

词级对齐：每个词的开始和结束时间
句级对齐：每个句子的时间范围
多语言支持：与ASR模型语言支持保持一致
高精度：毫秒级时间戳精度

3.2 对齐模型API调用

对齐模型需要两个输入：音频文件和转写文本。它会输出带时间戳的文本。

Python调用示例：

def align_audio_text(audio_file_path, text, language="zh"): """ 调用对齐模型获取时间戳 参数： audio_file_path: WAV音频文件路径 text: 转写文本（来自ASR模型） language: 语言代码 返回： 带时间戳的文本数据 """ import requests # 对齐模型API地址（假设运行在7862端口） align_api_url = "http://localhost:7862/align" # 准备请求 files = { 'audio': open(audio_file_path, 'rb') } data = { 'text': text, 'language': language } try: response = requests.post(align_api_url, files=files, data=data) if response.status_code == 200: return response.json() else: print(f"对齐失败，状态码：{response.status_code}") return None except Exception as e: print(f"对齐过程中出错：{str(e)}") return None finally: files['audio'].close() # 使用示例 if __name__ == "__main__": # 先获取转写文本 text = transcribe_audio("sample.wav", language="zh") if text: # 然后进行时间戳对齐 aligned_result = align_audio_text("sample.wav", text, language="zh") if aligned_result: print("对齐结果：") for word_info in aligned_result.get('words', []): print(f"词：{word_info['word']}, " f"开始：{word_info['start']:.3f}s, " f"结束：{word_info['end']:.3f}s")

3.3 对齐结果格式解析

对齐模型的返回结果通常包含多种格式，适应不同用途：

def parse_alignment_result(result): """ 解析对齐结果，生成不同格式的输出 参数： result: 对齐模型返回的JSON数据 返回： 多种格式的时间戳文本 """ if not result: return None words = result.get('words', []) sentences = result.get('sentences', []) # 格式1：SRT字幕格式（最常用） srt_output = [] for i, sentence in enumerate(sentences, 1): start_time = format_timestamp(sentence['start']) end_time = format_timestamp(sentence['end']) text = sentence['text'] srt_block = f"{i}\n{start_time} --> {end_time}\n{text}\n" srt_output.append(srt_block) # 格式2：简单时间戳格式 simple_output = [] for sentence in sentences: simple_output.append( f"[{sentence['start']:.2f}-{sentence['end']:.2f}s] {sentence['text']}" ) # 格式3：词级详细格式（用于语音分析） word_output = [] for word in words: word_output.append( f"{word['word']}({word['start']:.3f}-{word['end']:.3f})" ) return { 'srt': '\n'.join(srt_output), 'simple': '\n'.join(simple_output), 'words': ' '.join(word_output) } def format_timestamp(seconds): """将秒数格式化为SRT时间戳格式：HH:MM:SS,mmm""" hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = seconds % 60 return f"{hours:02d}:{minutes:02d}:{secs:06.3f}".replace('.', ',')

4. 第三步：构建完整语音处理流水线

现在我们把两个模型串联起来，创建一个完整的处理流程。这个流程可以处理单个文件，也可以批量处理多个文件。

4.1 完整处理脚本

import os import json import requests from typing import Dict, List, Optional from dataclasses import dataclass from pathlib import Path @dataclass class ProcessingConfig: """处理配置参数""" asr_api_url: str = "http://localhost:7861/transcribe" align_api_url: str = "http://localhost:7862/align" language: str = "auto" output_format: str = "srt" # srt, json, simple max_audio_duration: int = 300 # 最大音频时长（秒） segment_long_audio: bool = True # 是否分割长音频 class AudioProcessor: """音频处理主类""" def __init__(self, config: ProcessingConfig): self.config = config self.supported_formats = ['.wav'] def process_audio_file(self, audio_path: str) -> Dict: """ 处理单个音频文件 返回包含所有结果的字典 """ print(f"开始处理文件：{audio_path}") # 1. 检查文件格式 if not self._check_audio_format(audio_path): converted_path = self._convert_audio_format(audio_path) if not converted_path: return {"error": "音频格式不支持且转换失败"} audio_path = converted_path # 2. 检查音频时长 duration = self._get_audio_duration(audio_path) if duration > self.config.max_audio_duration and self.config.segment_long_audio: print(f"音频过长（{duration}秒），进行分割处理") return self._process_long_audio(audio_path, duration) # 3. 语音识别 print("进行语音识别...") transcription = self._transcribe(audio_path) if not transcription or 'text' not in transcription: return {"error": "语音识别失败", "file": audio_path} # 4. 时间戳对齐 print("进行时间戳对齐...") alignment = self._align(audio_path, transcription['text']) if not alignment: return { "file": audio_path, "transcription": transcription['text'], "alignment": None, "warning": "时间戳对齐失败，仅有转写文本" } # 5. 格式化输出 formatted_output = self._format_output( transcription['text'], alignment, audio_path ) return { "file": audio_path, "duration": duration, "language": transcription.get('language', 'unknown'), "transcription": transcription['text'], "alignment": alignment, "formatted": formatted_output, "success": True } def _transcribe(self, audio_path: str) -> Optional[Dict]: """调用ASR模型进行转写""" try: with open(audio_path, 'rb') as f: files = {'audio': f} data = {'language': self.config.language} response = requests.post( self.config.asr_api_url, files=files, data=data, timeout=30 ) if response.status_code == 200: return response.json() else: print(f"ASR请求失败：{response.status_code}") return None except Exception as e: print(f"转写过程中出错：{str(e)}") return None def _align(self, audio_path: str, text: str) -> Optional[Dict]: """调用对齐模型获取时间戳""" try: with open(audio_path, 'rb') as f: files = {'audio': f} data = { 'text': text, 'language': self.config.language } response = requests.post( self.config.align_api_url, files=files, data=data, timeout=30 ) if response.status_code == 200: return response.json() else: print(f"对齐请求失败：{response.status_code}") return None except Exception as e: print(f"对齐过程中出错：{str(e)}") return None def _format_output(self, text: str, alignment: Dict, audio_path: str) -> Dict: """根据配置格式输出结果""" base_name = Path(audio_path).stem if self.config.output_format == "srt": return self._create_srt(alignment, base_name) elif self.config.output_format == "json": return { "text": text, "alignment": alignment, "metadata": { "file": audio_path, "language": self.config.language } } else: # simple格式 return self._create_simple_text(alignment) def _create_srt(self, alignment: Dict, base_name: str) -> Dict: """生成SRT字幕格式""" sentences = alignment.get('sentences', []) srt_content = [] for i, sentence in enumerate(sentences, 1): start = self._seconds_to_srt_time(sentence['start']) end = self._seconds_to_srt_time(sentence['end']) srt_content.append(f"{i}\n{start} --> {end}\n{sentence['text']}\n") srt_text = '\n'.join(srt_content) # 保存到文件 output_file = f"{base_name}.srt" with open(output_file, 'w', encoding='utf-8') as f: f.write(srt_text) return { "format": "srt", "content": srt_text, "file": output_file, "sentence_count": len(sentences) } def _seconds_to_srt_time(self, seconds: float) -> str: """秒数转SRT时间格式""" hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = seconds % 60 return f"{hours:02d}:{minutes:02d}:{secs:06.3f}".replace('.', ',') def _create_simple_text(self, alignment: Dict) -> Dict: """生成简单文本格式""" sentences = alignment.get('sentences', []) lines = [] for sentence in sentences: lines.append(f"[{sentence['start']:.2f}-{sentence['end']:.2f}] {sentence['text']}") return { "format": "simple", "content": '\n'.join(lines), "sentence_count": len(sentences) } def _check_audio_format(self, audio_path: str) -> bool: """检查音频格式是否支持""" ext = Path(audio_path).suffix.lower() return ext in self.supported_formats def _convert_audio_format(self, audio_path: str) -> Optional[str]: """转换音频格式到WAV""" try: from pydub import AudioSegment output_path = Path(audio_path).with_suffix('.wav') audio = AudioSegment.from_file(audio_path) audio = audio.set_channels(1).set_frame_rate(16000) audio.export(output_path, format='wav') print(f"已转换格式：{audio_path} -> {output_path}") return str(output_path) except Exception as e: print(f"格式转换失败：{str(e)}") return None def _get_audio_duration(self, audio_path: str) -> float: """获取音频时长（秒）""" try: from pydub import AudioSegment audio = AudioSegment.from_file(audio_path) return len(audio) / 1000.0 # 毫秒转秒 except: return 0 def _process_long_audio(self, audio_path: str, duration: float) -> Dict: """处理长音频：分割后分别处理""" from pydub import AudioSegment import tempfile audio = AudioSegment.from_wav(audio_path) segment_length = self.config.max_audio_duration * 1000 # 毫秒 num_segments = int(duration // self.config.max_audio_duration) + 1 all_results = [] temp_dir = tempfile.mkdtemp() for i in range(num_segments): start_ms = i * segment_length end_ms = min((i + 1) * segment_length, len(audio)) if end_ms - start_ms < 1000: # 小于1秒跳过 continue segment = audio[start_ms:end_ms] segment_path = os.path.join(temp_dir, f"segment_{i+1}.wav") segment.export(segment_path, format='wav') print(f"处理分段 {i+1}/{num_segments}") result = self.process_audio_file(segment_path) all_results.append(result) # 合并结果 return self._merge_segment_results(all_results, audio_path) def _merge_segment_results(self, segment_results: List[Dict], original_file: str) -> Dict: """合并分段处理结果""" if not segment_results: return {"error": "所有分段处理失败", "file": original_file} # 合并转写文本 full_text = ' '.join([ r.get('transcription', '') for r in segment_results if r.get('success', False) ]) # 合并对齐结果（需要调整时间偏移） merged_alignment = self._merge_alignments(segment_results) return { "file": original_file, "transcription": full_text, "alignment": merged_alignment, "segments": segment_results, "segment_count": len(segment_results), "success": True } def _merge_alignments(self, segment_results: List[Dict]) -> Dict: """合并多个分段的对齐结果""" # 这里实现时间偏移调整逻辑 # 由于篇幅限制，简化实现 all_words = [] all_sentences = [] time_offset = 0 for result in segment_results: if not result.get('success', False): continue alignment = result.get('alignment', {}) words = alignment.get('words', []) sentences = alignment.get('sentences', []) # 调整时间偏移 for word in words: word['start'] += time_offset word['end'] += time_offset all_words.append(word) for sentence in sentences: sentence['start'] += time_offset sentence['end'] += time_offset all_sentences.append(sentence) # 更新时间偏移（假设每个分段时长相同，实际需要计算） if sentences: time_offset = sentences[-1]['end'] return { 'words': all_words, 'sentences': all_sentences } # 使用示例 if __name__ == "__main__": # 配置处理参数 config = ProcessingConfig( asr_api_url="http://localhost:7861/transcribe", align_api_url="http://localhost:7862/align", language="auto", output_format="srt", max_audio_duration=180, # 3分钟 segment_long_audio=True ) # 创建处理器 processor = AudioProcessor(config) # 处理单个文件 result = processor.process_audio_file("meeting_recording.wav") if result.get('success', False): print("处理成功！") print(f"转写文本：{result['transcription'][:100]}...") # 显示前100字符 print(f"输出文件：{result['formatted'].get('file', 'N/A')}") else: print(f"处理失败：{result.get('error', '未知错误')}")

4.2 批量处理与自动化

对于需要处理大量音频文件的场景，我们可以扩展上面的脚本：

def batch_process_audio_files(input_folder: str, output_folder: str, config: ProcessingConfig): """ 批量处理文件夹中的所有音频文件 参数： input_folder: 输入文件夹路径（包含音频文件） output_folder: 输出文件夹路径（保存结果） config: 处理配置 """ import glob from tqdm import tqdm # 进度条库 # 创建输出文件夹 os.makedirs(output_folder, exist_ok=True) # 获取所有音频文件 audio_files = [] for format in ['*.wav', '*.mp3', '*.m4a']: audio_files.extend(glob.glob(os.path.join(input_folder, format))) print(f"找到 {len(audio_files)} 个音频文件") # 创建处理器 processor = AudioProcessor(config) # 处理每个文件 results = [] for audio_file in tqdm(audio_files, desc="处理进度"): try: result = processor.process_audio_file(audio_file) # 保存结果 output_file = os.path.join( output_folder, f"{Path(audio_file).stem}_result.json" ) with open(output_file, 'w', encoding='utf-8') as f: json.dump(result, f, ensure_ascii=False, indent=2) results.append({ 'file': audio_file, 'success': result.get('success', False), 'output': output_file }) except Exception as e: print(f"处理文件 {audio_file} 时出错：{str(e)}") results.append({ 'file': audio_file, 'success': False, 'error': str(e) }) # 生成处理报告 generate_report(results, output_folder) return results def generate_report(results: List[Dict], output_folder: str): """生成处理报告""" total = len(results) success = sum(1 for r in results if r.get('success', False)) failed = total - success report = { "summary": { "total_files": total, "successful": success, "failed": failed, "success_rate": success / total if total > 0 else 0 }, "details": results } report_file = os.path.join(output_folder, "processing_report.json") with open(report_file, 'w', encoding='utf-8') as f: json.dump(report, f, ensure_ascii=False, indent=2) print(f"\n处理完成！") print(f"总计：{total} 个文件") print(f"成功：{success} 个（{success/total*100:.1f}%）") print(f"失败：{failed} 个") print(f"详细报告已保存至：{report_file}")

5. 实际应用场景与优化建议

5.1 典型应用场景

场景一：视频字幕自动生成

# 专门针对视频字幕的优化配置 video_config = ProcessingConfig( language="auto", output_format="srt", max_audio_duration=300, segment_long_audio=True ) # 视频字幕的特殊处理：合并短句，调整时间轴 def optimize_for_subtitles(alignment_result): """优化对齐结果，使其更适合字幕显示""" sentences = alignment_result.get('sentences', []) optimized = [] current_text = "" current_start = 0 current_end = 0 for sentence in sentences: sentence_text = sentence['text'] sentence_duration = sentence['end'] - sentence['start'] # 如果句子太短，合并到下一句 if len(sentence_text) < 10 and sentence_duration < 2.0: if not current_text: current_start = sentence['start'] current_text += " " + sentence_text current_end = sentence['end'] else: # 如果有合并的短句，先保存 if current_text: optimized.append({ 'text': current_text.strip(), 'start': current_start, 'end': current_end }) current_text = "" # 如果句子太长，分割 if len(sentence_text) > 100: # 按标点分割长句 parts = split_long_sentence(sentence_text, sentence['start'], sentence['end']) optimized.extend(parts) else: optimized.append(sentence) # 处理最后合并的句子 if current_text: optimized.append({ 'text': current_text.strip(), 'start': current_start, 'end': current_end }) return {'sentences': optimized}

场景二：会议纪要自动生成

def generate_meeting_minutes(audio_file, participants=None): """ 生成结构化会议纪要 包括：发言时间线、关键议题提取、行动项识别 """ # 1. 获取带时间戳的转写 result = process_audio_file(audio_file) if not result.get('success', False): return None # 2. 提取关键信息 sentences = result['alignment'].get('sentences', []) # 按时间分段（每5分钟一段） segments = segment_by_time(sentences, interval=300) # 3. 识别议题转换（基于关键词） topics = detect_topic_changes(sentences) # 4. 提取行动项（包含"需要"、"安排"、"负责"等词的句子） action_items = extract_action_items(sentences) # 5. 生成结构化纪要 minutes = { "meeting_info": { "duration": result.get('duration', 0), "language": result.get('language', 'unknown'), "total_sentences": len(sentences) }, "transcription": result['transcription'], "timeline": segments, "topics": topics, "action_items": action_items, "summary": generate_summary(sentences) } return minutes

5. 性能优化与最佳实践

5.1 资源优化配置

两个模型同时运行需要一定的计算资源，以下是一些优化建议：

# 资源监控与优化配置 class ResourceOptimizer: """资源优化管理""" @staticmethod def estimate_resource_requirements(audio_duration: float, concurrent_tasks: int = 1) -> Dict: """ 估算处理资源需求 参数： audio_duration: 音频总时长（秒） concurrent_tasks: 并发处理任务数 返回： 资源需求估算 """ # ASR模型显存：10-14GB # 对齐模型显存：约4-6GB # 建议总显存：16GB以上 estimated_time = audio_duration * 0.3 # RTF=0.3 memory_per_task = 16 # GB（两个模型合计） return { "total_audio_duration": audio_duration, "estimated_processing_time": estimated_time, "recommended_gpu_memory": memory_per_task * concurrent_tasks, "recommended_system_memory": 32 * concurrent_tasks, # GB "concurrent_tasks_supported": concurrent_tasks, "notes": "基于RTF=0.3估算，实际时间可能因音频内容而异" } @staticmethod def optimize_batch_processing(file_list: List[str], available_memory: int) -> List[List[str]]: """ 优化批量处理分组 根据可用内存将文件分组，避免内存溢出 """ # 假设每个文件处理需要2GB内存（保守估计） files_per_group = max(1, available_memory // 2) groups = [] current_group = [] current_size = 0 for file in file_list: file_size = os.path.getsize(file) / (1024**3) # GB if current_size + file_size > files_per_group and current_group: groups.append(current_group) current_group = [file] current_size = file_size else: current_group.append(file) current_size += file_size if current_group: groups.append(current_group) return groups

5.2 错误处理与重试机制

在实际生产环境中，稳定的错误处理至关重要：

class RobustAudioProcessor(AudioProcessor): """增强版的音频处理器，包含重试和错误恢复""" def __init__(self, config: ProcessingConfig, max_retries: int = 3): super().__init__(config) self.max_retries = max_retries self.error_log = [] def process_with_retry(self, audio_path: str) -> Dict: """带重试机制的处理""" for attempt in range(self.max_retries): try: result = self.process_audio_file(audio_path) if result.get('success', False): return result else: print(f"第{attempt + 1}次尝试失败，准备重试...") # 等待后重试 time.sleep(2 ** attempt) # 指数退避 except Exception as e: self.error_log.append({ 'file': audio_path, 'attempt': attempt + 1, 'error': str(e), 'timestamp': time.time() }) if attempt == self.max_retries - 1: print(f"所有{self.max_retries}次尝试均失败") return { 'file': audio_path, 'success': False, 'error': f"处理失败：{str(e)}", 'attempts': self.max_retries } return {'success': False, 'error': '未知错误'} def recover_partial_results(self, audio_path: str, temp_dir: str = None) -> Dict: """ 尝试恢复部分处理结果 用于处理中断后的恢复 """ if not temp_dir: temp_dir = os.path.join(os.path.dirname(audio_path), ".temp") recovery_data = { 'file': audio_path, 'recovered': False, 'partial_results': [] } # 检查临时文件 if os.path.exists(temp_dir): temp_files = glob.glob(os.path.join(temp_dir, "*.json")) for temp_file in temp_files: try: with open(temp_file, 'r', encoding='utf-8') as f: data = json.load(f) if data.get('file', '').endswith(audio_path): recovery_data['partial_results'].append(data) except: continue if recovery_data['partial_results']: recovery_data['recovered'] = True recovery_data['message'] = f"恢复了{len(recovery_data['partial_results'])}个部分结果" return recovery_data

6. 总结

6.1 核心要点回顾

通过本教程，我们完成了从语音识别到时间戳对齐的完整流程搭建：

Qwen3-ASR-1.7B部署与使用：学会了如何部署这个多语言语音识别模型，并通过API调用实现音频转文字功能。关键点是支持中、英、日、韩、粤五种语言，且能自动检测语言类型。
Qwen3-ForcedAligner-0.6B集成：掌握了时间戳对齐模型的部署方法，理解了它如何为转写文本添加精确的时间信息，这是制作字幕、分析语音节奏的关键。
完整处理流水线构建：将两个模型串联起来，创建了一个端到端的处理系统。这个系统可以处理各种格式的音频文件，自动转换格式，处理长音频，并输出多种格式的结果。
实际应用场景实现：针对视频字幕生成、会议纪要整理等具体场景，提供了优化方案和专用函数。