当前位置: 首页 > news >正文

程序员如何用Python爬取《风吹哪页读哪页》金句,打造个人专属的“心灵鸡汤”API接口

用Python构建《风吹哪页读哪页》金句API:从爬虫到情感计算的全栈实践

山林不向四季起誓,荣枯随缘。技术人的浪漫,往往藏在代码与诗意的交界处。当我们在键盘上敲击import requests时,想的不仅是数据抓取,更是如何让机器理解"醉后不知天在水"的意境。本文将带你完整实现一个能按情感分类返回金句的智能API,涵盖反爬破解、情感分析模型集成等进阶技巧,让技术工具真正成为人文内容的赋能者。

1. 环境配置与反爬策略设计

在开始爬取前,我们需要建立可靠的开发环境。不同于基础教程,这里特别强调生产级环境配置:

# 使用pipenv创建隔离环境 pipenv install requests beautifulsoup4 scrapy flask flask-restx pandas textblob pipenv shell # 推荐版本锁定 """ requests==2.31.0 beautifulsoup4==4.12.0 Flask==2.3.2 textblob==0.17.1 """

针对文学类网站常见的反爬机制,建议在requests会话中配置以下参数:

import requests from bs4 import BeautifulSoup import random import time session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Referer': 'https://book.douban.com/' } proxies = { 'http': 'http://user:pass@proxy_ip:port', 'https': 'https://user:pass@proxy_ip:port' } def delayed_request(url, delay=(1,3)): time.sleep(random.uniform(*delay)) try: response = session.get(url, headers=headers, proxies=proxies) response.raise_for_status() return response except Exception as e: print(f"Request failed: {str(e)}") return None

注意:实际部署时应将代理配置移至环境变量,并使用python-decouple管理敏感信息

2. 多源数据采集与结构化处理

假设目标数据分布在豆瓣读书、微信读书等多个平台,我们需要设计通用解析器:

class QuoteParser: def __init__(self, html): self.soup = BeautifulSoup(html, 'lxml') def extract_douban(self): quotes = [] for item in self.soup.select('.review-content'): text = ''.join(item.stripped_strings) if len(text) > 10: # 过滤短评 quotes.append({ 'text': text, 'source': 'douban' }) return quotes def extract_wechat(self): quotes = [] for section in self.soup.select('.section-content'): title = section.find_previous('h2').get_text() for p in section.select('p'): if p.get_text().strip(): quotes.append({ 'text': p.get_text().strip(), 'chapter': title, 'source': 'wechat' }) return quotes

数据清洗阶段采用正则表达式处理特殊字符:

import re def clean_text(text): text = re.sub(r'[\u200b-\u200f\ufeff]', '', text) # 去除零宽字符 text = re.sub(r'([。!?…])\1+', r'\1', text) # 去除重复标点 text = text.replace('\n', ' ').replace('\r', '') return text.strip()

3. 情感分析与自动分类系统

使用TextBlob结合自定义词典实现中文情感分析:

from textblob import TextBlob import jieba import jieba.analyse # 加载自定义情感词典 jieba.load_userdict('sentiment_dict.txt') def analyze_sentiment(text): # 关键词提取 tags = jieba.analyse.extract_tags(text, topK=5, withWeight=True) # 英文分析(需先翻译) translation = str(TextBlob(text).translate(to='en')) blob = TextBlob(translation) # 结合规则判断 if blob.sentiment.polarity > 0.3: return 'positive' elif blob.sentiment.polarity < -0.3: return 'negative' else: return 'neutral' # 示例情感词典格式 """ 璀璨 10 绝望 -8 洒脱 5 忧郁 -3 """

建立情感分类索引提升查询效率:

import sqlite3 def build_sentiment_index(quotes): conn = sqlite3.connect('quotes.db') c = conn.cursor() c.execute('''CREATE TABLE IF NOT EXISTS quotes (id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT, chapter TEXT, sentiment TEXT, length INTEGER)''') for q in quotes: sentiment = analyze_sentiment(q['text']) c.execute("INSERT INTO quotes VALUES (NULL,?,?,?,?)", (q['text'], q.get('chapter',''), sentiment, len(q['text']))) conn.commit() conn.close()

4. 高性能API架构设计

采用Flask-RESTx构建带缓存的分页API:

from flask import Flask, jsonify from flask_restx import Api, Resource, fields from functools import lru_cache import sqlite3 app = Flask(__name__) api = Api(app, version='1.0', title='Quotes API') ns = api.namespace('quotes', description='金句操作') quote_model = api.model('Quote', { 'id': fields.Integer, 'text': fields.String, 'sentiment': fields.String, 'length': fields.Integer }) @ns.route('/random') class RandomQuote(Resource): @ns.marshal_with(quote_model) @lru_cache(maxsize=1024) def get(self): conn = sqlite3.connect('quotes.db') conn.row_factory = sqlite3.Row c = conn.cursor() c.execute("SELECT * FROM quotes ORDER BY RANDOM() LIMIT 1") result = c.fetchone() conn.close() return dict(result) if result else None @ns.route('/search') class SearchQuotes(Resource): @ns.marshal_list_with(quote_model) def get(self): parser = api.parser() parser.add_argument('q', type=str) parser.add_argument('page', type=int, default=1) args = parser.parse_args() conn = sqlite3.connect('quotes.db') conn.row_factory = sqlite3.Row c = conn.cursor() query = "SELECT * FROM quotes WHERE text LIKE ? LIMIT 20 OFFSET ?" c.execute(query, (f"%{args['q']}%", (args['page']-1)*20)) results = [dict(row) for row in c.fetchall()] conn.close() return results

添加Redis缓存层提升性能:

from redis import Redis from flask_caching import Cache cache = Cache(config={ 'CACHE_TYPE': 'redis', 'CACHE_REDIS_URL': 'redis://localhost:6379/0', 'CACHE_DEFAULT_TIMEOUT': 300 }) cache.init_app(app) @ns.route('/daily') class DailyQuote(Resource): @cache.cached(timeout=86400, key_prefix='daily_quote') @ns.marshal_with(quote_model) def get(self): # 实现日期哈希算法选择当日金句 pass

5. 前端集成与可视化展示

使用ECharts实现情感分布可视化:

// 在Vue/React中集成 fetch('/api/quotes/stats') .then(res => res.json()) .then(data => { const chart = echarts.init(document.getElementById('chart')); chart.setOption({ series: [{ type: 'pie', data: [ {value: data.positive, name: '积极'}, {value: data.neutral, name: '中性'}, {value: data.negative, name: '消极'} ] }] }); });

生成词云增强展示效果:

from wordcloud import WordCloud import matplotlib.pyplot as plt def generate_wordcloud(): conn = sqlite3.connect('quotes.db') c = conn.cursor() c.execute("SELECT text FROM quotes") text = ' '.join(row[0] for row in c.fetchall()) wc = WordCloud( font_path='msyh.ttc', background_color='white', max_words=200 ).generate(text) plt.imshow(wc) plt.axis("off") plt.savefig('static/wordcloud.png')

6. 部署优化与监控

使用Gunicorn+Nginx部署方案:

# gunicorn.conf.py workers = 4 worker_class = 'gevent' bind = '0.0.0.0:8000' timeout = 120

添加Prometheus监控端点:

from prometheus_flask_exporter import PrometheusMetrics metrics = PrometheusMetrics(app) metrics.info('app_info', 'Quotes API', version='1.0.0') # 自定义指标 quotes_counter = metrics.counter( 'quotes_requests_total', 'Total number of quotes requests', labels={'status': lambda r: r.status_code} )

实现自动化测试流水线:

# .github/workflows/test.yml name: CI on: [push] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.9' - run: pip install -r requirements.txt - run: pytest tests/

7. 进阶功能:个性化推荐系统

基于用户历史构建推荐模型:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity class QuoteRecommender: def __init__(self): self.vectorizer = TfidfVectorizer() def train(self, quotes): self.corpus = [q['text'] for q in quotes] self.tfidf_matrix = self.vectorizer.fit_transform(self.corpus) def recommend(self, liked_quotes, n=5): input_vec = self.vectorizer.transform(liked_quotes) sim_scores = cosine_similarity(input_vec, self.tfidf_matrix) top_indices = sim_scores.mean(axis=0).argsort()[-n:][::-1] return [self.corpus[i] for i in top_indices]

实现用户行为追踪:

@app.after_request def track_usage(response): if request.path.startswith('/api/'): user_agent = request.headers.get('User-Agent', '') ip = request.remote_addr log_entry = f"{datetime.now()},{ip},{request.path},{response.status_code}\n" with open('usage.log', 'a') as f: f.write(log_entry) return response

8. 安全加固与性能调优

实施JWT认证保护API端点:

from flask_jwt_extended import JWTManager, jwt_required app.config['JWT_SECRET_KEY'] = 'super-secret' jwt = JWTManager(app) @ns.route('/protected') class ProtectedQuote(Resource): @jwt_required() def get(self): return {"msg": "敏感操作需认证"}

数据库查询优化策略:

# 建立复合索引 CREATE INDEX idx_sentiment_length ON quotes(sentiment, length); # 使用CTE优化复杂查询 WITH filtered AS ( SELECT * FROM quotes WHERE sentiment = 'positive' AND length BETWEEN 50 AND 100 ) SELECT * FROM filtered ORDER BY RANDOM() LIMIT 1;

在项目后期,发现使用sqlite3WAL模式可以提升并发性能:

conn.execute('PRAGMA journal_mode=WAL') conn.execute('PRAGMA synchronous=NORMAL')
http://www.jsqmd.com/news/854484/

相关文章:

  • 杭州E类人才、积分落户必看:如何利用软著快速攒够关键分值?
  • 别再傻傻分不清!ESP32驱动有源/无源蜂鸣器,这篇保姆级教程讲透了
  • 搞懂专业代剪辑,才能看懂好视频背后的逻辑
  • 【大数据ETL实战】基于Uniplore平台的学生考勤画像标签构建与踩坑记录
  • 告别黑框!树莓派4B远程桌面完整指南:从VNC配置到RealVNC/XRDP方案选择与优化
  • 视程空间AIR系列——小体积藏强芯,赋能机器人/机器狗全域落地
  • 告别手动配置!用Matlab+LUA脚本自动化DCA1000雷达数据采集(附1843配置实例)
  • 通过curl命令快速测试Taotoken API为大赛创意生成提供灵感
  • 5分钟解锁A股数据宝藏:Python通达信接口的量化交易实战指南
  • STM32F030硬件I2C避坑指南:Timing值、滤波器配置与NBYTES重加载模式详解
  • 对角矩阵的层次聚类
  • 全息三维空间孪生,全域无感精准智位系列:UWB:多路径干扰精度失稳|镜像:多源时空误差融合
  • 长春沙发翻新换皮靠谱商家推荐|匠阁、御匠、锦修三大品牌全解析、服务内容、全市上门 - 卓信营销
  • SPEC CPU 2017基准测试深度解析:从原理到实战调优
  • 在MMDetection 3.x中手把手复现EfficientDet的BiFPN模块(附代码逐行解读)
  • UWB:可视测距、遮挡失联|镜像:盲区推演、全域接续 可视测距受限与盲区智能重构技术解析
  • 校园外卖跑腿小程序系统Java代买帮忙配送源码解决方案
  • 【万字文档+源码】基于SpringBoot+vue社区药房系统 -可用于毕设-课程设计-练手学习
  • 飞驰人生3电影完整版免费看
  • 我的Type-C串口板又烧了?一个CH340N电路设计中的隐藏坑点与补救方案
  • 沈阳塑胶地板哪家靠谱?本地服务商实测指南
  • 保姆级教程:在Ubuntu 14.04上为ARM64交叉编译带WebRTC的ZLMediaKit(含libsrtp/OpenSSL避坑指南)
  • SaySo 语音识别相关技术解析,从语音输入到可用文本
  • 企业Agent体系建设:从CLI化到Skill化的完整指南
  • SWAT-MODFLOW地表与地下协同模拟及多情景专题应用
  • 别再只用COCO了!针对桥梁隧道裂缝检测,这份8000+样本的精细标注数据集评测与使用指南
  • Linux Shell生成随机文件:dd、openssl等工具实战与性能优化
  • Datasheet学习4(Audio)(TODO)
  • 别再搞混了!SAP物料主数据、BOM、工艺路线里的三种损耗率(Scrap)到底怎么配?
  • 5大核心技术突破:Source Han Serif CN开源字体全栈部署实战指南