当前位置：首页 > news >正文

LLM多模态开发

news 2026/8/3 23:21:21

图 / 文之间的相互转换、TTS/ASR/OCR

TTS（Text-to-Speech，文本转语音）：将文字信息转化为语音输出的技术。

ASR（Automatic Speech Recognition，自动语音识别）：将语音信号转化为文字的技术

OCR（Optical Character Recognition，光学字符识别）：将图像或扫描件中的文字转化为可编辑的文本的技术。

TTS

OpenAI 的 tts-1 模型，追求的是生成音频的速度：

from openai import OpenAI client = OpenAI() speech_file_path = "AI_speech.mp3" response = client.audio.speech.create( model="tts-1", voice="alloy", input="xxx" ) response.stream_to_file(speech_file_path)

tts-1-hd追求的是声音质量。

ASR

自动语音识别（ASR）是另一个受益于大语言模型发展的领域。

# 导入所需的库 import os import cv2 # 视频处理 import base64 # 编码帧 from moviepy.editor import VideoFileClip # 音频处理 VIDEO_FILE = "Good_Driver.mp4" def extract_frames_and_audio(video_file, interval=2): encoded_frames = [] file_name, _ = os.path.splitext(video_file) video_capture = cv2.VideoCapture(video_file) total_frame_count = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT)) frame_rate = video_capture.get(cv2.CAP_PROP_FPS) frames_interval = int(frame_rate * interval) current_frame = 0 # 循环遍历视频并以指定的采样率提取帧 while current_frame < total_frame_count - 1: video_capture.set(cv2.CAP_PROP_POS_FRAMES, current_frame) success, frame = video_capture.read() if not success: break _, buffer = cv2.imencode(".jpg", frame) encoded_frames.append(base64.b64encode(buffer).decode("utf-8")) current_frame += frames_interval video_capture.release() # 从视频中提取音频 audio_output = f"{file_name}.mp3" video_clip = VideoFileClip(video_file) video_clip.audio.write_audiofile(audio_output, bitrate="32k") video_clip.audio.close() video_clip.close() print(f"提取了 {len(encoded_frames)} 帧") print(f"音频提取到 {audio_output}") return encoded_frames, audio_output # 每2秒提取1帧（采样率） encoded_frames, audio_output = extract_frames_and_audio(VIDEO_FILE, interval=2)

查看全文

http://www.jsqmd.com/news/685161/