当前位置：首页 > news >正文

VADER Sentiment实战指南：如何为社交媒体文本注入情感智能

news 2026/7/6 5:09:46

VADER Sentiment实战指南：如何为社交媒体文本注入情感智能

【免费下载链接】vaderSentimentVADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.项目地址: https://gitcode.com/gh_mirrors/va/vaderSentiment

你是否曾面对海量的用户评论、社交媒体帖子或产品反馈，却苦于无法快速理解其中的情感倾向？在当今数据驱动的时代，情感分析已成为理解用户心声的关键技术。VADER Sentiment正是为解决这一痛点而生的利器，它专为社交媒体文本优化，却能轻松应对各种短文本情感分析场景。

为什么选择VADER而非其他方案？

在开始深入之前，让我们先明确VADER的独特价值。与其他情感分析工具相比，VADER有几个显著优势：

对比维度	VADER	传统机器学习方法	深度学习模型
部署速度	即时可用，无需训练	需要大量标注数据训练	需要大量数据和计算资源
社交媒体适应性	专门优化，理解网络用语	通用模型，效果一般	需要特定领域微调
计算效率	O(N)复杂度，极快	O(N²)或更高	O(N³)或更高
规则透明度	完全透明，可解释性强	黑盒模型，难以解释	高度黑盒，难以调试
特殊文本处理	完美处理表情符号、缩写	需要额外预处理	需要大量训练数据

快速上手：5分钟构建你的第一个情感分析器

安装与基本使用

让我们从最简单的安装开始。VADER可以通过pip一键安装：

pip install vaderSentiment

安装完成后，你就可以立即开始使用：

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer # 创建分析器实例 analyzer = SentimentIntensityAnalyzer() # 分析单条文本 text = "VADER is absolutely amazing! It's incredibly useful for social media analysis." scores = analyzer.polarity_scores(text) print(f"文本: {text}") print(f"情感得分: {scores}") print(f"情感判断: {'积极' if scores['compound'] >= 0.05 else '消极' if scores['compound'] <= -0.05 else '中性'}")

这段代码会输出：

文本: VADER is absolutely amazing! It's incredibly useful for social media analysis. 情感得分: {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469} 情感判断: 积极

理解输出结果

VADER返回四个关键指标：

neg: 负面情感比例（0-1之间）
neu: 中性情感比例（0-1之间）
pos: 正面情感比例（0-1之间）
compound: 综合情感得分（-1到1之间）

小贴士: compound得分是最常用的指标，通常的阈值是：大于0.05为积极，小于-0.05为消极，中间为中性。

核心概念：VADER如何"思考"情感

情感词典的智慧

VADER的核心是一个包含7500多个词汇的情感词典，每个词汇都有从-4（极度负面）到+4（极度正面）的情感强度值。这个词典的特别之处在于：

社交媒体友好：包含大量网络用语、缩写和表情符号
强度分级：不仅判断正负，还能区分情感强度
人工验证：每个词汇都由10名独立评审员验证

# 查看词典中的词汇示例 analyzer = SentimentIntensityAnalyzer() # 查看一些词汇的情感值 sample_words = ['excellent', 'good', 'okay', 'bad', 'terrible', 'lol', ':)', 'sucks'] for word in sample_words: if word in analyzer.lexicon: print(f"{word}: {analyzer.lexicon[word]}")

语法规则的魔力

VADER不仅仅是简单的词典匹配，它通过一系列语法规则来理解文本的细微差别：

否定词处理："not good"会被识别为负面
程度副词增强："very good"比"good"更积极
大写强调："AMAZING"比"amazing"更强烈
标点符号影响："Good!!!"比"Good."更积极
转折词处理："but"会改变前后部分的情感权重

# 展示语法规则的影响 test_sentences = [ "The product is good.", "The product is not good.", # 否定词 "The product is very good.", # 程度副词 "The product is VERY GOOD!", # 大写强调 "The product is good, but expensive.", # 转折词 ] analyzer = SentimentIntensityAnalyzer() for sentence in test_sentences: scores = analyzer.polarity_scores(sentence) print(f"{sentence:50} -> 综合得分: {scores['compound']:.4f}")

实战进阶：处理真实世界的数据

批量处理社交媒体数据

在实际应用中，我们通常需要处理大量文本数据。以下是一个实用的批量处理示例：

import pandas as pd from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer def analyze_social_media_data(tweets_df, text_column='text'): """ 批量分析社交媒体数据的情感 参数: tweets_df: 包含文本数据的DataFrame text_column: 文本列的名称 返回: 添加了情感分析的DataFrame """ analyzer = SentimentIntensityAnalyzer() # 批量计算情感得分 def get_sentiment_scores(text): scores = analyzer.polarity_scores(str(text)) return pd.Series([ scores['neg'], scores['neu'], scores['pos'], scores['compound'] ]) # 应用情感分析 sentiment_cols = ['neg_score', 'neu_score', 'pos_score', 'compound_score'] tweets_df[sentiment_cols] = tweets_df[text_column].apply(get_sentiment_scores) # 添加情感标签 tweets_df['sentiment'] = tweets_df['compound_score'].apply( lambda x: 'positive' if x >= 0.05 else 'negative' if x <= -0.05 else 'neutral' ) return tweets_df # 使用示例 tweets_data = pd.DataFrame({ 'text': [ "Just tried the new feature, it's awesome! 😍", "The update broke my workflow. Very frustrating.", "Meh, it's okay I guess.", "LOVE the new interface!!! So intuitive!", "Not bad, but could be better." ], 'user': ['user1', 'user2', 'user3', 'user4', 'user5'], 'timestamp': pd.date_range('2024-01-01', periods=5, freq='H') }) result_df = analyze_social_media_data(tweets_data) print(result_df[['text', 'compound_score', 'sentiment']])

情感时间序列分析

对于社交媒体监控或产品反馈分析，时间维度至关重要：

import matplotlib.pyplot as plt import seaborn as sns from datetime import datetime, timedelta def analyze_sentiment_trends(data_df, time_column='timestamp', text_column='text'): """ 分析情感随时间变化的趋势 参数: data_df: 包含时间和文本的数据 time_column: 时间列名 text_column: 文本列名 返回: 时间序列分析结果和可视化图表 """ # 确保时间格式正确 data_df[time_column] = pd.to_datetime(data_df[time_column]) # 进行情感分析 analyzer = SentimentIntensityAnalyzer() data_df['sentiment_score'] = data_df[text_column].apply( lambda x: analyzer.polarity_scores(str(x))['compound'] ) # 按时间分组（例如按小时） data_df['hour'] = data_df[time_column].dt.floor('H') hourly_sentiment = data_df.groupby('hour')['sentiment_score'].agg(['mean', 'count']).reset_index() # 创建可视化 fig, axes = plt.subplots(2, 1, figsize=(12, 8)) # 情感得分趋势 axes[0].plot(hourly_sentiment['hour'], hourly_sentiment['mean'], marker='o', linewidth=2, color='steelblue') axes[0].axhline(y=0.05, color='green', linestyle='--', alpha=0.5, label='Positive Threshold') axes[0].axhline(y=-0.05, color='red', linestyle='--', alpha=0.5, label='Negative Threshold') axes[0].fill_between(hourly_sentiment['hour'], hourly_sentiment['mean'], alpha=0.3, color='steelblue') axes[0].set_title('情感得分随时间变化趋势', fontsize=14, fontweight='bold') axes[0].set_xlabel('时间') axes[0].set_ylabel('平均情感得分') axes[0].legend() axes[0].grid(True, alpha=0.3) # 数据量分布 axes[1].bar(hourly_sentiment['hour'], hourly_sentiment['count'], color='lightcoral', alpha=0.7) axes[1].set_title('文本数量随时间分布', fontsize=14, fontweight='bold') axes[1].set_xlabel('时间') axes[1].set_ylabel('文本数量') axes[1].grid(True, alpha=0.3) plt.tight_layout() return fig, hourly_sentiment

高级技巧：定制化与优化

扩展情感词典

虽然VADER的词典已经很全面，但在特定领域可能需要添加自定义词汇：

def extend_vader_lexicon(custom_words_dict): """ 扩展VADER情感词典 参数: custom_words_dict: 字典，格式为{'词汇': 情感值} 情感值范围建议在-4到4之间 返回: 扩展后的分析器实例 """ from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer analyzer = SentimentIntensityAnalyzer() # 添加自定义词汇 analyzer.lexicon.update(custom_words_dict) return analyzer # 示例：添加技术领域特定词汇 tech_lexicon = { 'buggy': -2.5, # 有bug的 'responsive': 2.0, # 响应迅速的 'scalable': 2.5, # 可扩展的 'bloated': -2.0, # 臃肿的 'intuitive': 3.0, # 直观的 'clunky': -2.8, # 笨重的 'smooth': 2.2, # 流畅的 'crashes': -3.5, # 崩溃 'snappy': 2.3, # 快速的 'laggy': -2.5 # 卡顿的 } # 创建定制化的分析器 custom_analyzer = extend_vader_lexicon(tech_lexicon) # 测试定制词典的效果 tech_reviews = [ "The app is very responsive and intuitive!", "It's buggy and crashes frequently.", "The interface is smooth but a bit clunky in some areas." ] for review in tech_reviews: scores = custom_analyzer.polarity_scores(review) print(f"{review:60} -> 得分: {scores['compound']:.4f}")

处理长文本的策略

VADER最适合处理短文本，但对于长文本，我们可以采用分句策略：

from nltk.tokenize import sent_tokenize import nltk # 下载nltk数据（首次运行需要） # nltk.download('punkt') def analyze_long_text(text, analyzer=None): """ 分析长文本的情感 参数: text: 长文本内容 analyzer: VADER分析器实例 返回: 整体情感得分和分句分析结果 """ if analyzer is None: from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer analyzer = SentimentIntensityAnalyzer() # 分句处理 sentences = sent_tokenize(text) # 分析每个句子 sentence_scores = [] for sentence in sentences: scores = analyzer.polarity_scores(sentence) sentence_scores.append({ 'sentence': sentence, 'scores': scores, 'sentiment': 'positive' if scores['compound'] >= 0.05 else 'negative' if scores['compound'] <= -0.05 else 'neutral' }) # 计算整体情感（加权平均） total_compound = sum(s['scores']['compound'] for s in sentence_scores) avg_compound = total_compound / len(sentence_scores) if sentence_scores else 0 return { 'overall_sentiment': 'positive' if avg_compound >= 0.05 else 'negative' if avg_compound <= -0.05 else 'neutral', 'overall_score': avg_compound, 'sentence_analysis': sentence_scores, 'sentence_count': len(sentences) } # 示例：分析产品评论 long_review = """ I've been using this product for three months now. The initial setup was straightforward and the interface is quite intuitive. However, I've experienced several crashes during important meetings, which was very frustrating. The customer support team was responsive and helped me resolve some issues, but the stability problems persist. On the positive side, the performance is excellent when it works properly. The export features are particularly useful for my workflow. """ result = analyze_long_text(long_review) print(f"整体情感: {result['overall_sentiment']} (得分: {result['overall_score']:.4f})") print(f"句子数量: {result['sentence_count']}") print("\n分句分析:") for i, analysis in enumerate(result['sentence_analysis'], 1): print(f"{i}. {analysis['sentence']}") print(f" 情感: {analysis['sentiment']}, 得分: {analysis['scores']['compound']:.4f}")

性能优化与最佳实践

批量处理优化

当需要处理大量数据时，性能至关重要：

import multiprocessing as mp from functools import partial import numpy as np def batch_sentiment_analysis(texts, n_workers=None): """ 使用多进程批量分析文本情感 参数: texts: 文本列表 n_workers: 进程数，默认为CPU核心数 返回: 情感得分列表 """ if n_workers is None: n_workers = mp.cpu_count() # 定义处理函数 def analyze_batch(text_batch): analyzer = SentimentIntensityAnalyzer() return [analyzer.polarity_scores(text)['compound'] for text in text_batch] # 分批处理 batch_size = max(1, len(texts) // n_workers) batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)] # 使用多进程并行处理 with mp.Pool(processes=n_workers) as pool: results = pool.map(analyze_batch, batches) # 合并结果 all_scores = [] for batch_result in results: all_scores.extend(batch_result) return all_scores # 性能测试示例 def benchmark_performance(): """性能基准测试""" import time # 生成测试数据 test_texts = ["This is test text number {}".format(i) for i in range(1000)] # 单进程测试 start_time = time.time() analyzer = SentimentIntensityAnalyzer() single_results = [analyzer.polarity_scores(text)['compound'] for text in test_texts] single_time = time.time() - start_time # 多进程测试 start_time = time.time() multi_results = batch_sentiment_analysis(test_texts) multi_time = time.time() - start_time print(f"单进程处理时间: {single_time:.2f}秒") print(f"多进程处理时间: {multi_time:.2f}秒") print(f"加速比: {single_time/multi_time:.2f}倍") print(f"结果一致性检查: {np.allclose(single_results, multi_results)}")

内存优化策略

对于超大规模数据处理，内存管理很重要：

import gc from itertools import islice def process_large_file(file_path, batch_size=1000): """ 处理大型文本文件，避免内存溢出 参数: file_path: 文本文件路径 batch_size: 每批处理的行数 返回: 生成器，逐批返回情感分析结果 """ analyzer = SentimentIntensityAnalyzer() def process_batch(batch_lines): """处理一批文本""" results = [] for line in batch_lines: line = line.strip() if line: # 跳过空行 scores = analyzer.polarity_scores(line) results.append({ 'text': line, 'compound': scores['compound'], 'sentiment': 'positive' if scores['compound'] >= 0.05 else 'negative' if scores['compound'] <= -0.05 else 'neutral' }) return results with open(file_path, 'r', encoding='utf-8') as f: while True: batch = list(islice(f, batch_size)) if not batch: break yield process_batch(batch) # 释放内存 gc.collect()

常见陷阱与解决方案

陷阱1：过度依赖compound分数

问题: 只关注compound分数而忽略其他维度
解决方案: 结合neg、neu、pos三个维度进行综合分析

def comprehensive_sentiment_analysis(text): """ 全面的情感分析，考虑所有维度 """ analyzer = SentimentIntensityAnalyzer() scores = analyzer.polarity_scores(text) # 多维度分析 analysis = { 'text': text, 'scores': scores, 'primary_sentiment': None, 'confidence': None, 'mixed_sentiment': False } # 判断主要情感 if scores['compound'] >= 0.05: analysis['primary_sentiment'] = 'positive' analysis['confidence'] = scores['pos'] elif scores['compound'] <= -0.05: analysis['primary_sentiment'] = 'negative' analysis['confidence'] = scores['neg'] else: analysis['primary_sentiment'] = 'neutral' analysis['confidence'] = scores['neu'] # 检查是否混合情感（同时包含显著的正负面） if scores['pos'] > 0.3 and scores['neg'] > 0.3: analysis['mixed_sentiment'] = True return analysis

陷阱2：忽略领域特定语言

问题: 通用词典无法处理特定领域术语
解决方案: 创建领域特定的情感词典扩展

class DomainSpecificAnalyzer: """领域特定的情感分析器""" def __init__(self, domain_name, custom_lexicon=None): self.analyzer = SentimentIntensityAnalyzer() self.domain = domain_name # 加载领域特定词典 if custom_lexicon: self.analyzer.lexicon.update(custom_lexicon) # 领域特定的阈值调整 self.thresholds = self._get_domain_thresholds(domain_name) def _get_domain_thresholds(self, domain): """获取领域特定的情感阈值""" thresholds = { 'product_reviews': {'positive': 0.1, 'negative': -0.1}, 'social_media': {'positive': 0.05, 'negative': -0.05}, 'customer_feedback': {'positive': 0.07, 'negative': -0.07}, 'news_articles': {'positive': 0.03, 'negative': -0.03} } return thresholds.get(domain, {'positive': 0.05, 'negative': -0.05}) def analyze(self, text): """领域特定的情感分析""" scores = self.analyzer.polarity_scores(text) # 使用领域特定阈值 if scores['compound'] >= self.thresholds['positive']: sentiment = 'positive' elif scores['compound'] <= self.thresholds['negative']: sentiment = 'negative' else: sentiment = 'neutral' return { 'domain': self.domain, 'text': text, 'scores': scores, 'sentiment': sentiment, 'thresholds_used': self.thresholds }

生态系统整合：VADER与其他工具的结合

与Pandas和Scikit-learn集成

import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer class SentimentAnalysisPipeline: """完整的情感分析流水线""" def __init__(self): self.analyzer = SentimentIntensityAnalyzer() self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english') self.lda = LatentDirichletAllocation(n_components=5, random_state=42) def fit_transform(self, texts): """ 完整的文本分析流水线： 1. 情感分析 2. 文本向量化 3. 主题建模 """ # 情感分析 sentiment_results = [] for text in texts: scores = self.analyzer.polarity_scores(text) sentiment_results.append({ 'compound': scores['compound'], 'positive': scores['pos'], 'negative': scores['neg'], 'neutral': scores['neu'] }) # 文本向量化 tfidf_matrix = self.vectorizer.fit_transform(texts) # 主题建模 topic_distributions = self.lda.fit_transform(tfidf_matrix) # 整合结果 results_df = pd.DataFrame(sentiment_results) results_df['text'] = texts results_df['dominant_topic'] = topic_distributions.argmax(axis=1) return results_df def analyze_with_context(self, texts, metadata=None): """ 结合元数据进行情感分析 """ results = self.fit_transform(texts) if metadata is not None: metadata_df = pd.DataFrame(metadata) results = pd.concat([results, metadata_df], axis=1) return results

实时情感监控系统

import asyncio import aiohttp from datetime import datetime import json class RealTimeSentimentMonitor: """实时情感监控系统""" def __init__(self, api_endpoints, update_interval=60): self.analyzer = SentimentIntensityAnalyzer() self.api_endpoints = api_endpoints self.update_interval = update_interval self.sentiment_history = [] async def fetch_data(self, session, endpoint): """异步获取数据""" async with session.get(endpoint) as response: return await response.json() async def monitor_sentiment(self): """监控情感变化""" async with aiohttp.ClientSession() as session: while True: current_time = datetime.now() # 并行获取所有数据源 tasks = [self.fetch_data(session, endpoint) for endpoint in self.api_endpoints] results = await asyncio.gather(*tasks, return_exceptions=True) # 分析情感 all_texts = [] for result in results: if isinstance(result, dict) and 'data' in result: texts = [item.get('text', '') for item in result['data']] all_texts.extend(texts) if all_texts: sentiment_scores = [self.analyzer.polarity_scores(text)['compound'] for text in all_texts] avg_sentiment = sum(sentiment_scores) / len(sentiment_scores) # 记录历史 self.sentiment_history.append({ 'timestamp': current_time, 'avg_sentiment': avg_sentiment, 'sample_size': len(all_texts), 'positive_ratio': sum(1 for s in sentiment_scores if s >= 0.05) / len(sentiment_scores) }) # 保留最近100条记录 if len(self.sentiment_history) > 100: self.sentiment_history = self.sentiment_history[-100:] print(f"[{current_time}] 平均情感: {avg_sentiment:.4f}, " f"样本数: {len(all_texts)}, " f"积极比例: {self.sentiment_history[-1]['positive_ratio']:.2%}") await asyncio.sleep(self.update_interval) def get_sentiment_trend(self, window_size=10): """获取情感趋势""" if len(self.sentiment_history) < window_size: return None recent = self.sentiment_history[-window_size:] sentiments = [item['avg_sentiment'] for item in recent] # 简单趋势分析 if len(sentiments) >= 2: trend = sentiments[-1] - sentiments[0] if trend > 0.1: return "strongly_improving" elif trend > 0.01: return "improving" elif trend < -0.1: return "strongly_declining" elif trend < -0.01: return "declining" else: return "stable" return None

性能调优指南

内存使用优化

import psutil import os class MemoryOptimizedAnalyzer: """内存优化的情感分析器""" def __init__(self, max_memory_mb=500): self.analyzer = SentimentIntensityAnalyzer() self.max_memory_mb = max_memory_mb self.batch_results = [] def check_memory_usage(self): """检查内存使用情况""" process = psutil.Process(os.getpid()) memory_mb = process.memory_info().rss / 1024 / 1024 return memory_mb def analyze_with_memory_limit(self, texts, batch_size=100): """ 带内存限制的批量分析 参数: texts: 文本列表 batch_size: 每批处理数量 返回: 情感分析结果 """ results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] # 检查内存使用 current_memory = self.check_memory_usage() if current_memory > self.max_memory_mb: print(f"警告: 内存使用超过限制 ({current_memory:.1f}MB)，清理缓存") self.batch_results.clear() import gc gc.collect() # 处理当前批次 batch_result = [] for text in batch: scores = self.analyzer.polarity_scores(text) batch_result.append({ 'text': text, 'compound': scores['compound'], 'sentiment': 'positive' if scores['compound'] >= 0.05 else 'negative' if scores['compound'] <= -0.05 else 'neutral' }) results.extend(batch_result) self.batch_results.append(batch_result) # 清理旧批次结果以释放内存 if len(self.batch_results) > 5: self.batch_results.pop(0) return results

缓存优化策略

from functools import lru_cache import hashlib class CachedSentimentAnalyzer: """带缓存的情感分析器""" def __init__(self, max_cache_size=10000): self.analyzer = SentimentIntensityAnalyzer() self.cache = {} self.max_cache_size = max_cache_size self.hits = 0 self.misses = 0 def _get_text_hash(self, text): """获取文本的哈希值用于缓存键""" return hashlib.md5(text.encode('utf-8')).hexdigest() @lru_cache(maxsize=10000) def analyze_cached(self, text): """带缓存的情感分析""" return self.analyzer.polarity_scores(text) def analyze_batch_cached(self, texts): """批量分析，使用缓存优化""" results = [] for text in texts: text_hash = self._get_text_hash(text) if text_hash in self.cache: results.append(self.cache[text_hash]) self.hits += 1 else: scores = self.analyzer.polarity_scores(text) self.cache[text_hash] = scores results.append(scores) self.misses += 1 # 缓存清理策略 if len(self.cache) > self.max_cache_size: # 简单的LRU策略：移除最早的一半缓存 keys_to_remove = list(self.cache.keys())[:self.max_cache_size // 2] for key in keys_to_remove: del self.cache[key] cache_hit_rate = self.hits / (self.hits + self.misses) if (self.hits + self.misses) > 0 else 0 print(f"缓存命中率: {cache_hit_rate:.2%}") return results

未来展望：VADER的演进方向

多语言支持扩展

虽然VADER主要针对英文设计，但可以通过翻译API扩展多语言支持：

from deep_translator import GoogleTranslator class MultilingualSentimentAnalyzer: """多语言情感分析器""" def __init__(self, target_language='en'): self.analyzer = SentimentIntensityAnalyzer() self.target_language = target_language self.supported_languages = ['en', 'es', 'fr', 'de', 'zh', 'ja', 'ko'] def detect_language(self, text): """简单语言检测（实际应用中应使用专业库）""" # 这里使用简单启发式方法，实际应使用langdetect等库 if any(char in text for char in '你好谢谢'): return 'zh' elif any(char in text for char in 'こんにちはありがとう'): return 'ja' elif any(char in text for char in '안녕감사합니다'): return 'ko' else: return 'en' # 默认英文 def analyze_multilingual(self, text): """分析多语言文本""" # 检测语言 source_lang = self.detect_language(text) # 如果需要翻译 if source_lang != self.target_language: try: translated = GoogleTranslator( source=source_lang, target=self.target_language ).translate(text) except: translated = text # 翻译失败时使用原文 else: translated = text # 分析情感 scores = self.analyzer.polarity_scores(translated) return { 'original_text': text, 'translated_text': translated, 'source_language': source_lang, 'target_language': self.target_language, 'scores': scores, 'sentiment': 'positive' if scores['compound'] >= 0.05 else 'negative' if scores['compound'] <= -0.05 else 'neutral' }

深度学习增强版本

import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split class EnhancedSentimentAnalyzer: """增强版情感分析器：结合VADER和机器学习""" def __init__(self): self.vader_analyzer = SentimentIntensityAnalyzer() self.ml_model = RandomForestClassifier(n_estimators=100, random_state=42) self.is_trained = False def extract_vader_features(self, text): """提取VADER特征""" scores = self.vader_analyzer.polarity_scores(text) # 基础特征 features = [ scores['compound'], scores['pos'], scores['neg'], scores['neu'], len(text.split()), # 文本长度 text.count('!'), # 感叹号数量 text.count('?'), # 问号数量 sum(1 for c in text if c.isupper()) / max(1, len(text)), # 大写比例 ] return np.array(features).reshape(1, -1) def train(self, texts, labels): """训练增强模型""" # 提取特征 features = [] for text in texts: feat = self.extract_vader_features(text) features.append(feat.flatten()) features = np.array(features) # 训练模型 X_train, X_test, y_train, y_test = train_test_split( features, labels, test_size=0.2, random_state=42 ) self.ml_model.fit(X_train, y_train) self.is_trained = True # 评估模型 train_score = self.ml_model.score(X_train, y_train) test_score = self.ml_model.score(X_test, y_test) print(f"训练集准确率: {train_score:.4f}") print(f"测试集准确率: {test_score:.4f}") return train_score, test_score def predict(self, text): """预测情感""" if not self.is_trained: # 使用纯VADER scores = self.vader_analyzer.polarity_scores(text) compound = scores['compound'] return 'positive' if compound >= 0.05 else 'negative' if compound <= -0.05 else 'neutral' # 使用增强模型 features = self.extract_vader_features(text) prediction = self.ml_model.predict(features)[0] return prediction