当前位置：首页 > news >正文

Marqo语音搜索系统：解锁音频内容的信息价值

news 2026/3/26 17:07:57

Marqo语音搜索系统：解锁音频内容的信息价值

【免费下载链接】marqoVector search for humans. Also available on cloud - cloud.marqo.ai项目地址: https://gitcode.com/gh_mirrors/ma/marqo

引言：语音数据的价值与挑战

语音是人类最自然的交流方式之一，全球大量有价值的信息都以音频形式存在，包括视频、电影、电视节目、电话录音、会议记录等。然而，这些语音数据的检索和利用面临着巨大挑战：

高维度特性：音频数据在时域表现为波形，需要高采样率（通常16kHz-40kHz）才能准确还原人耳可感知的声音
信息密度低：英语平均语速约2.5词/秒，意味着每个词需要数千个浮点数表示
内容提取困难：原始音频无法直接用于搜索，需要转换为文本或其他可索引形式

系统架构概述

Marqo语音搜索系统通过以下流程将音频转换为可搜索的知识库：

数据采集：从多种来源获取音频内容
说话人分离：识别不同说话人及其发言时段
语音转文本：将语音内容转换为可索引的文本
索引构建：使用Marqo创建高效搜索索引
问答交互：基于检索结果生成自然语言回答

核心技术实现详解

1. 数据采集与预处理

系统支持多种音频来源的采集：

class AudioWrangler(): def __init__(self, output_path: str, clean_up: bool = True): self.output_path = output_path self.tmp_dir = 'downloads' os.makedirs(os.path.join(ABS_FILE_FOLDER, self.tmp_dir), exist_ok=True)

关键功能：

YouTube视频音频提取（使用yt_dlp库）
网络音频文件下载（支持多种格式）
批量下载处理（支持并行下载）
格式统一转换为WAV（使用Pydub库）

2. 说话人分离与语音识别

说话人分离（Diarization）技术识别音频中不同说话人的发言时段：

def annotate(self, file: str) -> List[Tuple[float, float, Set[str]]]: diarization = self.annotation_pipeline(file) speaker_times = [] for t in diarization.get_timeline(): start, end = t.start, t.end # 将长段落分割为30秒片段 while end - start > 0: speaker_times.append( (start, min(start + 30, end), diarization.get_labels(t)) ) start += 30 return speaker_times

技术要点：

使用Pyannote的说话人分离模型（需Hugging Face token）
自动处理长音频分段
支持重叠说话人识别

语音识别采用Facebook的S2T模型：

self.transcription_model = Speech2TextForConditionalGeneration.from_pretrained( f"facebook/s2t-{self._model_size}-librispeech-asr" )

3. Marqo索引构建

处理后的语音数据转换为结构化文档并索引：

def index_transcriptions( annotated_transcriptions: List[Dict[str, Any]], index: str, mq: marqo.Client, tensor_fields: List[str] = [], device: str = "cpu", batch_size: int = 32, ) -> Dict[str, str]: # 过滤无效转录 annotated_transcriptions = [ at for at in annotated_transcriptions if len(at["transcription"]) > 5 or len({*at["transcription"]}) > 4 ] response = mq.index(index).add_documents( annotated_transcriptions, tensor_fields=tensor_fields, device=device, client_batch_size=batch_size ) return response

索引字段：

说话人标识（SPEAKER_00等）
开始/结束时间戳
转录文本内容
原始音频文件路径

4. 智能问答系统

结合检索结果与语言模型生成自然语言回答：

TEMPLATE = """ You are a question answerer, given the CONTEXT provided you will answer the QUESTION... CONTEXT: ========= {context} QUESTION: ========= {question} """

问答流程：