当前位置：首页 > news >正文

SenseVoice-Small ONNX实现多语言语音识别：Java开发实战

news 2026/3/27 0:53:18

SenseVoice-Small ONNX实现多语言语音识别：Java开发实战

1. 引言

在企业级应用开发中，语音识别技术正变得越来越重要。无论是客服系统的语音转写、会议记录的自动生成，还是多语言场景下的实时翻译，都需要高效可靠的语音识别解决方案。SenseVoice-Small作为一个轻量级的多语言语音识别模型，支持中文、英文、日语、韩语等多种语言，识别效果优于同类模型，同时具备出色的推理性能。

对于Java开发者来说，如何在SpringBoot框架中集成这样的AI模型是一个值得探讨的话题。传统上，Python在AI领域占据主导地位，但在企业级应用中，Java仍然是不可替代的选择。本文将带你一步步实现SenseVoice-Small ONNX模型在Java环境中的集成，让你能够在熟悉的Java生态中享受先进的语音识别能力。

2. 环境准备与依赖配置

2.1 系统要求与基础环境

在开始之前，确保你的开发环境满足以下要求：

JDK 11或更高版本
Maven 3.6+ 或 Gradle 7+
SpringBoot 2.7+ 或 3.0+
至少4GB可用内存（模型推理需要一定内存空间）

2.2 核心依赖配置

在pom.xml中添加必要的依赖：

<dependencies> <!-- SpringBoot Web支持 --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <!-- ONNX Runtime Java SDK --> <dependency> <groupId>com.microsoft.onnxruntime</groupId> <artifactId>onnxruntime</artifactId> <version>1.16.0</version> </dependency> <!-- 音频处理库 --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.7.0</version> </dependency> <!-- 文件处理工具 --> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.13.0</version> </dependency> </dependencies>

3. 模型准备与加载

3.1 获取SenseVoice-Small ONNX模型

首先需要获取预训练好的ONNX模型文件。你可以从ModelScope或HuggingFace平台下载：

@Component public class ModelLoader { @Value("${model.sensevoice.path}") private String modelPath; private OrtSession session; private OrtEnvironment environment; @PostConstruct public void init() throws OrtException { environment = OrtEnvironment.getEnvironment(); OrtSession.SessionOptions sessionOptions = new OrtSession.SessionOptions(); // 配置会话选项 sessionOptions.setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT); sessionOptions.setInterOpNumThreads(4); sessionOptions.setIntraOpNumThreads(4); // 加载模型 session = environment.createSession(modelPath, sessionOptions); } public OrtSession getSession() { return session; } }

3.2 模型输入输出分析

SenseVoice-Small模型的输入输出结构如下：

输入：音频特征矩阵，形状为[1, 序列长度, 560]
输出：识别结果概率分布
额外参数：语言标识、文本规范化标志

4. 音频预处理实现

4.1 音频文件读取与格式转换

@Service public class AudioPreprocessor { public float[] loadAndConvertAudio(String audioPath) throws IOException { AudioInputStream audioInputStream = AudioSystem.getAudioInputStream( new File(audioPath)); AudioFormat sourceFormat = audioInputStream.getFormat(); AudioFormat targetFormat = new AudioFormat( AudioFormat.Encoding.PCM_FLOAT, sourceFormat.getSampleRate(), 16, sourceFormat.getChannels(), sourceFormat.getChannels() * 2, sourceFormat.getSampleRate(), false); AudioInputStream convertedStream = AudioSystem.getAudioInputStream( targetFormat, audioInputStream); byte[] audioBytes = convertedStream.readAllBytes(); return convertBytesToFloatArray(audioBytes); } private float[] convertBytesToFloatArray(byte[] audioBytes) { float[] floatArray = new float[audioBytes.length / 4]; ByteBuffer.wrap(audioBytes).asFloatBuffer().get(floatArray); return floatArray; } }

4.2 特征提取与标准化

public class FeatureExtractor { public static OnnxTensor extractFeatures(float[] audioData) throws OrtException { // 计算FBank特征 float[][] fbankFeatures = computeFbank(audioData, 16000, 80); // 应用均值方差归一化 normalizeFeatures(fbankFeatures); // 转换为ONNX Tensor long[] shape = {1, (long) fbankFeatures.length, 80}; return OnnxTensor.createTensor(OrtEnvironment.getEnvironment(), flattenArray(fbankFeatures), shape); } private static float[][] computeFbank(float[] audio, int sampleRate, int numMelBins) { // 实现FBank特征提取逻辑 // 包括预加重、分帧、加窗、FFT、Mel滤波器组应用等步骤 return new float[0][0]; } }

5. SpringBoot集成实战

5.1 配置类设计

@Configuration public class SpeechRecognitionConfig { @Bean @ConditionalOnProperty(name = "speech.recognition.enabled", havingValue = "true") public SpeechRecognitionService speechRecognitionService(ModelLoader modelLoader) { return new SpeechRecognitionService(modelLoader); } @Bean public ModelLoader modelLoader( @Value("${model.sensevoice.path}") String modelPath) { return new ModelLoader(modelPath); } }

5.2 核心服务实现

@Service public class SpeechRecognitionService { private final ModelLoader modelLoader; private final AudioPreprocessor audioPreprocessor; public SpeechRecognitionService(ModelLoader modelLoader, AudioPreprocessor audioPreprocessor) { this.modelLoader = modelLoader; this.audioPreprocessor = audioPreprocessor; } public RecognitionResult recognizeSpeech(String audioPath, String language) { try { // 1. 预处理音频 float[] audioData = audioPreprocessor.loadAndConvertAudio(audioPath); // 2. 提取特征 OnnxTensor features = FeatureExtractor.extractFeatures(audioData); // 3. 准备模型输入 Map<String, OnnxTensor> inputs = prepareModelInputs(features, language); // 4. 执行推理 OrtSession.Result results = modelLoader.getSession().run(inputs); // 5. 处理输出结果 return processRecognitionResult(results); } catch (Exception e) { throw new RecognitionException("语音识别失败", e); } } private Map<String, OnnxTensor> prepareModelInputs(OnnxTensor features, String language) throws OrtException { Map<String, OnnxTensor> inputs = new HashMap<>(); inputs.put("x", features); // 添加语言标识 long[] languageId = getLanguageId(language); inputs.put("language", OnnxTensor.createTensor( OrtEnvironment.getEnvironment(), languageId, new long[]{1})); return inputs; } }

5.3 RESTful API设计

@RestController @RequestMapping("/api/speech") public class SpeechRecognitionController { @Autowired private SpeechRecognitionService recognitionService; @PostMapping("/recognize") public ResponseEntity<RecognitionResponse> recognize( @RequestParam("audio") MultipartFile audioFile, @RequestParam(value = "language", defaultValue = "auto") String language) { try { // 保存上传的音频文件 String tempFilePath = saveUploadedFile(audioFile); // 执行语音识别 RecognitionResult result = recognitionService.recognizeSpeech( tempFilePath, language); return ResponseEntity.ok(new RecognitionResponse( result.getText(), result.getConfidence(), System.currentTimeMillis())); } finally { // 清理临时文件 cleanupTempFile(tempFilePath); } } }

6. 性能优化与最佳实践

6.1 内存管理优化

public class MemoryOptimizedRecognition { // 使用try-with-resources确保资源释放 public RecognitionResult recognizeWithResourceManagement(String audioPath) { try (OnnxTensor features = FeatureExtractor.extractFeatures(audioData); OnnxTensor languageTensor = createLanguageTensor()) { Map<String, OnnxTensor> inputs = Map.of( "x", features, "language", languageTensor ); try (OrtSession.Result results = session.run(inputs)) { return processResults(results); } } } }

6.2 批量处理实现

@Service public class BatchRecognitionService { @Async public CompletableFuture<RecognitionResult> recognizeAsync(String audioPath) { return CompletableFuture.supplyAsync(() -> recognitionService.recognizeSpeech(audioPath, "auto")); } public List<RecognitionResult> recognizeBatch(List<String> audioPaths) { return audioPaths.parallelStream() .map(path -> recognizeAsync(path)) .map(CompletableFuture::join) .collect(Collectors.toList()); } }

6.3 缓存策略

@Service @CacheConfig(cacheNames = "recognitionResults") public class CachedRecognitionService { @Cacheable(key = "#audioHash + #language") public RecognitionResult recognizeWithCache(String audioPath, String language) { String audioHash = computeAudioHash(audioPath); return recognitionService.recognizeSpeech(audioPath, language); } private String computeAudioHash(String filePath) { // 计算音频文件哈希值用于缓存键 try { byte[] fileContent = Files.readAllBytes(Paths.get(filePath)); return DigestUtils.md5DigestAsHex(fileContent); } catch (IOException e) { throw new RuntimeException("文件读取失败", e); } } }

7. 错误处理与监控

7.1 异常处理设计

@ControllerAdvice public class RecognitionExceptionHandler { @ExceptionHandler(RecognitionException.class) public ResponseEntity<ErrorResponse> handleRecognitionException( RecognitionException ex) { ErrorResponse error = new ErrorResponse( "RECOGNITION_ERROR", ex.getMessage(), System.currentTimeMillis()); return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR) .body(error); } @ExceptionHandler(OrtException.class) public ResponseEntity<ErrorResponse> handleOrtException(OrtException ex) { ErrorResponse error = new ErrorResponse( "MODEL_ERROR", "模型推理错误: " + ex.getMessage(), System.currentTimeMillis()); return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR) .body(error); } }

7.2 监控与日志

@Aspect @Component @Slf4j public class RecognitionMonitor { @Around("execution(* com.example.service.SpeechRecognitionService.recognizeSpeech(..))") public Object monitorRecognition(ProceedingJoinPoint joinPoint) throws Throwable { long startTime = System.currentTimeMillis(); String audioPath = (String) joinPoint.getArgs()[0]; try { Object result = joinPoint.proceed(); long duration = System.currentTimeMillis() - startTime; log.info("语音识别完成 - 音频: {}, 耗时: {}ms", audioPath, duration); // 推送监控指标 Metrics.recordRecognitionTime(duration); return result; } catch (Exception e) { log.error("语音识别失败 - 音频: {}", audioPath, e); Metrics.recordRecognitionError(); throw e; } } }

8. 实际应用场景

8.1 客服系统集成

@Service public class CustomerServiceIntegration { public void processCustomerCall(String callRecordingPath) { RecognitionResult result = recognitionService.recognizeSpeech( callRecordingPath, "zh"); // 提取关键信息 Map<String, String> extractedInfo = extractCustomerInfo(result.getText()); // 生成工单 createServiceTicket(extractedInfo); // 分析客户情绪 analyzeCustomerSentiment(result.getText()); } }

8.2 会议记录自动化

@Service public class MeetingTranscriptionService { public MeetingSummary transcribeMeeting(String meetingAudioPath) { List<RecognitionResult> segmentResults = segmentAndRecognize( meetingAudioPath); MeetingSummary summary = new MeetingSummary(); summary.setTranscription(combineSegments(segmentResults)); summary.setActionItems(extractActionItems(segmentResults)); summary.setParticipants(identifySpeakers(segmentResults)); return summary; } }