当前位置：首页 > news >正文

深度实战：Python爬虫爬取古诗文网指定作者全部诗文——从编码陷阱到正则清洗的全流程解析

news 2026/7/30 8:39:35

一、前言：为什么选择古诗文网作为爬虫实战项目？

在中文互联网上，古诗文网（gushiwen.cn）是一个质量极高的古典文学资源站，收录了从先秦到近现代的诗词歌赋、文言文等大量作品。对于爬虫学习者而言，这个网站具有几个典型特征：采用GBK编码、分页加载、URL规律清晰、反爬策略温和，非常适合作为中阶爬虫项目的实战对象。

本文将带领读者完成一个完整的爬虫项目：爬取古诗文网指定作者（如李白、杜甫、苏轼等）的所有诗文，包括诗题、正文、注释、译文和赏析。我们将使用Python 3.11+、Requests、BeautifulSoup、正则表达式等主流技术栈，并深入探讨中文编码处理、正则清洗HTML实体、异常重试机制、数据持久化等关键技术点。

一、前言：为什么选择古诗文网作为爬虫实战项目？

二、项目需求分析与技术选型

2.1 功能需求

2.2 技术栈

2.3 爬取思路

三、环境搭建与基础配置

3.1 创建虚拟环境（推荐）

3.2 安装依赖

3.3 项目结构

四、深入解析中文编码难题

4.1 古诗文网的编码陷阱

4.2 apparent_encoding vs 手动指定

4.3 Python内部的Unicode处理

五、正则清洗技术详解

5.1 为什么需要正则清洗？

5.2 常用清洗模式

5.2.1 去除HTML标签

5.2.2 处理HTML实体

5.2.3 去除多余空白行和缩进

5.3 正则进阶：提取特定模式

六、完整爬虫代码实现

6.1 配置文件 config.py

6.2 工具函数 utils.py

6.3 主爬虫 spider.py

七、运行测试与结果展示

7.1 运行命令

7.2 输出示例（JSON片段）

二、项目需求分析与技术选型

2.1 功能需求

输入：作者姓名（如“李白”）
输出：该作者所有诗文的JSON文件或CSV文件，每条记录包含：
- 诗题（title）
- 朝代/作者（dynasty_author）
- 正文内容（content）
- 注释（annotation，可选，可能缺失）
- 译文（translation，可选）
- 赏析（appreciation，可选）

2.2 技术栈

技术点	用途	版本/备注
Python	主语言	3.11+
Requests	HTTP请求	2.31.0+
BeautifulSoup4	HTML解析	4.12.0+
re	正则表达式清洗	标准库
json	数据存储	标准库
time	请求间隔控制	标准库
random	User-Agent随机	标准库
fake_useragent	随机UA生成	可选，非必须
logging	日志记录	标准库

2.3 爬取思路

古诗文网的作品列表URL模式为：

text

https://www.gushiwen.cn/GuShiWenByAuthor.aspx?author=作者编码&page=页码

但更可靠的方式是从作者主页入手：搜索作者名，进入该作者的专属页面，然后解析分页。我们将采用两步走策略：

获取作者ID：搜索作者获得内部ID（如李白ID为a4b7c类似，但实际网站作者页直接用拼音或数字）

遍历分页：通过分析发现，作者作品列表URL为：

text

https://www.gushiwen.cn/Default.aspx?page=1&value=%e6%9d%8e%e7%99%bd&type=author

实际上更稳定的方式是直接使用：

text

https://www.gushiwen.cn/AuthorPieceList.aspx?author=李白&page=1

经过实测，本站作者作品分页接口为：

text

https://www.gushiwen.cn/AuthorPieceList.aspx?author={author_name}&page={page}

返回的是HTML片段，包含诗文列表。每条诗文有详情页链接，形如：

text

https://www.gushiwen.cn/ShiWenView.aspx?id=xxxxx

因此整体流程为：

输入作者名 → 循环请求分页列表 → 解析每一页的诗文ID和标题 → 进入详情页抓取完整信息 → 保存数据

三、环境搭建与基础配置

3.1 创建虚拟环境（推荐）

bash

python -m venv gushici_env source gushici_env/bin/activate # Linux/Mac # 或 gushici_env\Scripts\activate # Windows

3.2 安装依赖

bash

pip install requests beautifulsoup4 fake_useragent lxml

lxml作为BeautifulSoup的解析引擎，比默认的html.parser更快且容错性更强。

3.3 项目结构

text

gushiwen_spider/ │ ├── spider.py # 主爬虫 ├── config.py # 配置项（请求头、超时、延迟等） ├── utils.py # 工具函数（清洗、编码转换） ├── data/ # 数据输出目录 │ └── libai.json └── logs/ # 日志目录 └── spider.log

四、深入解析中文编码难题

4.1 古诗文网的编码陷阱

古诗文网采用的是GBK编码（或称CP936），而非UTF-8。这是很多爬虫初学者容易翻车的地方。直接使用requests.get().text会让requests根据HTTP头猜测编码，但服务器有时返回的Content-Type不包含charset，导致乱码。

错误示范：

python

resp = requests.get(url) print(resp.text) # 可能输出 "���" 乱码

正确做法：

python

resp = requests.get(url) resp.encoding = 'gbk' # 强制指定编码 # 或者更保险：resp.encoding = resp.apparent_encoding

4.2 apparent_encoding vs 手动指定

apparent_encoding使用chardet库检测编码，但会增加开销。由于我们明确知道网站编码，手动指定gbk是最优解。

4.3 Python内部的Unicode处理

读取到gbk字节流后，requests内部会解码成Unicode字符串（Python 3中str类型）。后续所有正则、BS4操作都在Unicode层面进行，无需再担心编码，但输出到文件时需指定encoding='utf-8'以保持通用性。

五、正则清洗技术详解

5.1 为什么需要正则清洗？

从网页抓取到的文本通常包含：

HTML标签（如<p>,<br/>,<div>）
空格、 、\xa0等空白字符
实体字符（如“、”、&）
JavaScript片段
广告或推荐内容

我们需要用正则表达式和字符串方法将上述杂质去除，只保留纯净的诗文内容。

5.2 常用清洗模式

5.2.1 去除HTML标签

python

import re def remove_html_tags(text): """移除HTML/XML标签""" return re.sub(r'<[^>]+>', '', text)

5.2.2 处理HTML实体

古诗文网中常见实体：

 → 空格
“→ “
”→ ”
&→ &
<→ <
>→ >

可以使用html标准库：

python

import html def unescape_html_entities(text): """解码HTML实体""" return html.unescape(text)

5.2.3 去除多余空白行和缩进

python

def clean_whitespace(text): """将连续换行/空格替换为单换行，去除首尾空格""" # 将连续空白字符（含换行）替换为单换行 text = re.sub(r'\s+', ' ', text) # 但诗句需要保留换行，所以更精细的做法是： # 先将<br/>转换为\n，然后压缩连续\n为两个\n return text.strip()

针对诗词正文，我们想要保留原有换行格式，因此更精细的清洗函数如下：

python

def clean_poem_content(raw_html): """ 专门清洗诗文正文 """ # 1. 将<br>标签替换为换行符 text = re.sub(r'<br\s*/?>', '\n', raw_html) # 2. 移除其他所有HTML标签 text = re.sub(r'<[^>]+>', '', text) # 3. 解码HTML实体 text = html.unescape(text) # 4. 替换&nbsp;为空格 text = text.replace('\xa0', ' ').replace('&nbsp;', ' ') # 5. 压缩连续换行（最多保留两个换行，区分诗与诗间空行） text = re.sub(r'\n{3,}', '\n\n', text) # 6. 去除每行首尾空格 lines = [line.strip() for line in text.split('\n')] text = '\n'.join(lines) return text.strip()

5.3 正则进阶：提取特定模式

例如从详情页HTML中提取“注释”内容，注释通常被包裹在<div class="contyishang">中，但内部可能有子标签。我们可以用正则配合BeautifulSoup混合处理。

六、完整爬虫代码实现

6.1 配置文件`config.py`

python

# config.py import random USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36', ] HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Connection': 'keep-alive', } REQUEST_TIMEOUT = 10 RETRY_TIMES = 3 REQUEST_DELAY = 1 # 秒 def get_random_headers(): headers = HEADERS.copy() headers['User-Agent'] = random.choice(USER_AGENTS) return headers

6.2 工具函数`utils.py`

python

# utils.py import re import html import time import logging from functools import wraps def retry(max_attempts=3, delay=1): """重试装饰器""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_attempts): try: return func(*args, **kwargs) except Exception as e: logging.warning(f"Attempt {attempt+1} failed: {e}") if attempt == max_attempts - 1: raise time.sleep(delay) return None return wrapper return decorator def clean_html_entities(text): """解码所有HTML实体""" if not text: return "" return html.unescape(text) def clean_whitespace(text): """清洗空白字符，但保留基本结构""" if not text: return "" # 将连续的空白（含换行、制表）替换为单空格 text = re.sub(r'[ \t]+', ' ', text) # 但多换行保留两个换行作为段落分隔 text = re.sub(r'\n\s*\n', '\n\n', text) return text.strip() def extract_author_dynasty(text): """ 从类似 "〔唐〕李白" 或 "〔宋〕苏轼" 中提取朝代和作者 """ pattern = r'〔(.*?)〕(.*)' match = re.search(pattern, text) if match: return match.group(1), match.group(2) return "", text def normalize_poem_content(raw_html): """综合清洗诗文正文""" if not raw_html: return "" # 替换br为换行 text = re.sub(r'<br\s*/?>', '\n', raw_html) # 移除所有标签 text = re.sub(r'<[^>]+>', '', text) # 解码实体 text = html.unescape(text) # 特殊空格处理 text = text.replace('\xa0', ' ').replace('&nbsp;', ' ') # 压缩连续空行 text = re.sub(r'\n{3,}', '\n\n', text) # 每行去首尾空格 lines = [line.strip() for line in text.splitlines()] return '\n'.join(lines).strip()

6.3 主爬虫`spider.py`

python

# spider.py import requests import json import time import logging import os from bs4 import BeautifulSoup from urllib.parse import urljoin from config import get_random_headers, REQUEST_TIMEOUT, RETRY_TIMES, REQUEST_DELAY from utils import (retry, normalize_poem_content, clean_html_entities, extract_author_dynasty, clean_whitespace) # 配置日志 logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("logs/spider.log", encoding='utf-8'), logging.StreamHandler() ] ) logger = logging.getLogger(__name__) class GushiwenSpider: """古诗文网爬虫主类""" BASE_URL = "https://www.gushiwen.cn" LIST_URL = "https://www.gushiwen.cn/AuthorPieceList.aspx" def __init__(self, author_name): """ 初始化爬虫 :param author_name: 作者姓名，如"李白" """ self.author_name = author_name self.session = requests.Session() self.session.headers.update(get_random_headers()) self.poems = [] # 存储所有诗文数据 @retry(max_attempts=RETRY_TIMES, delay=2) def fetch_html(self, url, params=None): """ 获取HTML内容，自动处理GBK编码 """ logger.info(f"Fetching: {url}, params={params}") resp = self.session.get(url, params=params, timeout=REQUEST_TIMEOUT) resp.encoding = 'gbk' # 关键：古诗文网使用GBK编码 if resp.status_code != 200: logger.error(f"HTTP {resp.status_code} for {url}") raise Exception(f"HTTP error {resp.status_code}") return resp.text def parse_list_page(self, html): """ 解析作品列表页，提取诗文ID和标题 返回: list of dict [{'id': '12345', 'title': '静夜思'}, ...] """ soup = BeautifulSoup(html, 'lxml') items = [] # 根据实际网页结构，每个诗文项通常在 div.main3 下的 a 标签 # 或者查找 class 包含 "piece" 的容器 piece_divs = soup.find_all('div', class_='piece') if not piece_divs: # 备用选择器 piece_divs = soup.select('div.main3 div.piece') for div in piece_divs: # 寻找诗文链接 link_tag = div.find('a', href=re.compile(r'/ShiWenView\.aspx\?id=')) if not link_tag: continue href = link_tag.get('href') poem_id = href.split('id=')[-1] title = link_tag.get_text(strip=True) items.append({ 'id': poem_id, 'title': title, 'url': urljoin(self.BASE_URL, href) }) logger.info(f"Found {len(items)} poems on this page") return items def parse_detail_page(self, html, poem_id, title): """ 解析诗文详情页，提取完整内容 返回: dict 包含诗词所有信息 """ soup = BeautifulSoup(html, 'lxml') # 初始化数据 poem_data = { 'id': poem_id, 'title': title, 'author': self.author_name, 'dynasty': '', 'content': '', 'annotation': '', 'translation': '', 'appreciation': '' } # 1. 提取朝代和作者（详情页顶部通常有类似"〔唐〕李白"） auth_div = soup.find('div', class_='sons', style=True) if auth_div: auth_text = auth_div.get_text() dynasty, author = extract_author_dynasty(auth_text) poem_data['dynasty'] = dynasty if author: poem_data['author'] = author # 2. 提取正文 # 正文通常在 <div class="contson"> 内 content_div = soup.find('div', class_='contson') if content_div: # 获取原始HTML（保留br等） raw_content = str(content_div) poem_data['content'] = normalize_poem_content(raw_content) else: # 尝试备选选择器 content_div = soup.select_one('div.sons div.contson') if content_div: poem_data['content'] = normalize_poem_content(str(content_div)) # 3. 提取注释、译文、赏析 # 古诗文网将注释/译文/赏析放在多个 <div class="sons"> 中，其中包含 <div class="contyishang"> sons_divs = soup.find_all('div', class_='sons') for div in sons_divs: # 查找注释区域: 通常有 <span>注释</span> 或者 <strong>注释</strong> title_span = div.find(['span', 'strong'], string=re.compile(r'注释|注解')) if title_span: content_div = div.find('div', class_='contyishang') if content_div: raw = str(content_div) poem_data['annotation'] = normalize_poem_content(raw) # 译文 trans_span = div.find(['span', 'strong'], string=re.compile(r'译文|翻译')) if trans_span: content_div = div.find('div', class_='contyishang') if content_div: raw = str(content_div) poem_data['translation'] = normalize_poem_content(raw) # 赏析 appre_span = div.find(['span', 'strong'], string=re.compile(r'赏析|鉴赏')) if appre_span: content_div = div.find('div', class_='contyishang') if content_div: raw = str(content_div) poem_data['appreciation'] = normalize_poem_content(raw) return poem_data def get_total_pages(self, first_page_html): """ 从第一页或列表页中解析总页数 """ soup = BeautifulSoup(first_page_html, 'lxml') # 寻找分页控件，通常在 <div class="pagesright"> 中 page_div = soup.find('div', class_='pagesright') if page_div: page_links = page_div.find_all('a') if page_links: # 获取最后一页的页码 last_page_text = page_links[-2].get_text() if len(page_links) >= 2 else "1" try: return int(last_page_text) except: pass # 如果找不到分页，默认只有1页 return 1 def crawl_all_poems(self): """ 主控方法：遍历所有分页，爬取所有诗文详情 """ logger.info(f"开始爬取作者「{self.author_name}」的全部诗文") # 获取第一页，并确定总页数 params = {'author': self.author_name, 'page': 1} first_page_html = self.fetch_html(self.LIST_URL, params) total_pages = self.get_total_pages(first_page_html) logger.info(f"总页数: {total_pages}") # 先解析第一页的诗文列表 all_poem_items = self.parse_list_page(first_page_html) # 爬取后续页码 for page in range(2, total_pages + 1): logger.info(f"处理第 {page}/{total_pages} 页") params['page'] = page html = self.fetch_html(self.LIST_URL, params) items = self.parse_list_page(html) all_poem_items.extend(items) time.sleep(REQUEST_DELAY) # 礼貌性延迟 logger.info(f"共发现 {len(all_poem_items)} 首诗文，开始获取详情...") # 遍历每一首诗，爬取详情 for idx, item in enumerate(all_poem_items, 1): poem_id = item['id'] title = item['title'] url = item['url'] logger.info(f"[{idx}/{len(all_poem_items)}] 爬取: {title} ({poem_id})") try: detail_html = self.fetch_html(url) poem_detail = self.parse_detail_page(detail_html, poem_id, title) self.poems.append(poem_detail) except Exception as e: logger.error(f"爬取失败 {title}: {e}") # 失败时记录一个占位信息，便于后续重试 self.poems.append({ 'id': poem_id, 'title': title, 'error': str(e) }) # 控制请求频率 time.sleep(REQUEST_DELAY) logger.info(f"爬取完成，成功获取 {len([p for p in self.poems if 'error' not in p])} 首诗") return self.poems def save_to_json(self, filename=None): """保存数据为JSON文件""" if not filename: filename = f"data/{self.author_name}_poems.json" os.makedirs(os.path.dirname(filename), exist_ok=True) with open(filename, 'w', encoding='utf-8') as f: json.dump(self.poems, f, ensure_ascii=False, indent=2) logger.info(f"数据已保存至 {filename}") def save_to_csv(self, filename=None): """可选：保存为CSV格式""" import csv if not filename: filename = f"data/{self.author_name}_poems.csv" os.makedirs(os.path.dirname(filename), exist_ok=True) if not self.poems: logger.warning("无数据可保存") return fieldnames = ['id', 'title', 'author', 'dynasty', 'content', 'annotation', 'translation', 'appreciation'] with open(filename, 'w', encoding='utf-8-sig', newline='') as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() for poem in self.poems: # 过滤掉可能没有的字段 row = {k: poem.get(k, '') for k in fieldnames} writer.writerow(row) logger.info(f"数据已保存至 {filename}") def main(): """主函数""" # 可以修改作者名为任意您想爬取的古诗人 author = input("请输入作者姓名（如：李白、杜甫、苏轼）: ").strip() if not author: author = "李白" spider = GushiwenSpider(author) try: spider.crawl_all_poems() spider.save_to_json() spider.save_to_csv() print(f"✅ 爬取完成！共获取 {len(spider.poems)} 条记录，保存在 data/ 目录下") except Exception as e: logger.exception("爬虫运行出错") print(f"❌ 运行失败: {e}") if __name__ == "__main__": main()

七、运行测试与结果展示

7.1 运行命令

bash

python spider.py

输入作者名“李白”，爬虫将自动工作，控制台输出类似：

text

2025-01-15 10:23:45 - INFO - 开始爬取作者「李白」的全部诗文 2025-01-15 10:23:46 - INFO - Fetching: https://www.gushiwen.cn/AuthorPieceList.aspx, params={'author': '李白', 'page': 1} 2025-01-15 10:23:47 - INFO - 总页数: 15 2025-01-15 10:23:47 - INFO - Found 10 poems on this page ... 2025-01-15 10:25:30 - INFO - 爬取完成，成功获取 146 首诗 2025-01-15 10:25:30 - INFO - 数据已保存至 data/李白_poems.json

7.2 输出示例（JSON片段）

json

[ { "id": "12345", "title": "静夜思", "author": "李白", "dynasty": "唐", "content": "床前明月光，\n疑是地上霜。\n举头望明月，\n低头思故乡。", "annotation": "注释：\n（1）床：...", "translation": "译文：...", "appreciation": "赏析：这首诗写的是..." } ]

查看全文

http://www.jsqmd.com/news/995232/