当前位置：首页 > news >正文

Python + Requests + BeautifulSoup：10分钟搭建你的第一个网页爬虫

news 2026/6/15 14:59:30

Python + Requests + BeautifulSoup：10分钟搭建你的第一个网页爬虫

这篇文章写给所有想要学习爬虫但不知道从何入手的朋友，特别是编程新手和想要快速入门数据采集的开发者。

解决什么问题：当你想要从网页上获取信息时，不知道如何编写爬虫程序，面对复杂的网络请求和HTML解析束手无策。

为什么写这篇：我自己刚开始学习爬虫时，也踩了很多坑，从环境配置到代码实现，走了很多弯路。今天我把这些经验分享给大家，让你少走弯路，10分钟就能搭建起自己的第一个爬虫。

痛点分析

为什么很多初学者觉得爬虫很难？

1. 环境配置复杂

不知道需要安装哪些库
版本兼容性问题
依赖关系混乱

2. HTTP请求复杂

不了解HTTP协议基础
请求参数构造困难
响应数据解析麻烦

3. HTML解析困难

面对复杂的HTML结构无从下手
不知道如何准确定位元素
数据提取逻辑混乱

这些问题其实都有成熟的解决方案，今天我们就用最简单的方式，一步步搭建起你的第一个爬虫。

环境准备

在开始之前，我们需要准备以下工具和库：

必需的Python库

# 安装Requests库 - 用于发送HTTP请求 pip install requests # 安装BeautifulSoup4库 - 用于解析HTML pip install beautifulsoup4 # 安装lxml解析器 - 提供更快的HTML解析速度 pip install lxml

验证安装

import requests from bs4 import BeautifulSoup import lxml print("所有库安装成功！")

开发环境推荐

Python 3.6+ 版本
代码编辑器：VS Code 或 PyCharm
浏览器：Chrome（用于调试）

分步实战

步骤1：发送第一个HTTP请求

我们先从一个简单的网页开始，发送HTTP请求获取页面内容：

import requests # 发送GET请求 url = "http://httpbin.org/get" response = requests.get(url) # 检查请求是否成功 if response.status_code == 200: print("请求成功！") print("响应内容：") print(response.text[:500]) # 只显示前500个字符 else: print(f"请求失败，状态码：{response.status_code}")

说明：

requests.get()发送GET请求
response.status_code检查HTTP状态码
200表示请求成功

步骤2：设置请求头

为了避免被网站识别为爬虫，我们需要设置合适的请求头：

import requests # 设置请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8' } # 发送带请求头的GET请求 url = "http://httpbin.org/get" response = requests.get(url, headers=headers) print("请求头设置成功！") print("响应中的请求头信息：") print(response.json()['headers'])

说明：

User-Agent模拟真实浏览器
Accept告诉服务器我们接受什么类型的响应
Accept-Language设置语言偏好

步骤3：解析HTML页面

现在我们开始解析真实的HTML页面。我们以一个简单的新闻网站为例：

import requests from bs4 import BeautifulSoup # 发送请求获取页面 url = "http://quotes.toscrape.com/" # 一个专门用于爬虫练习的网站 response = requests.get(url) # 使用BeautifulSoup解析HTML soup = BeautifulSoup(response.text, 'lxml') # 查找所有名言 quotes = soup.find_all('div', class_='quote') print(f"找到 {len(quotes)} 条名言：") print("-" * 50) for i, quote in enumerate(quotes[:3], 1): # 只显示前3条 text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text print(f"{i}. {text}") print(f" —— {author}") print()

说明：

BeautifulSoup(response.text, 'lxml')解析HTML
find_all()查找所有匹配的元素
find()查找第一个匹配的元素
text获取文本内容

步骤4：提取数据并保存

我们把提取的数据保存到CSV文件中：

import requests from bs4 import BeautifulSoup import csv import time # 发送请求获取页面 url = "http://quotes.toscrape.com/" response = requests.get(url) # 使用BeautifulSoup解析HTML soup = BeautifulSoup(response.text, 'lxml') # 准备数据列表 quotes_data = [] # 查找所有名言 quotes = soup.find_all('div', class_='quote') for quote in quotes: text = quote.find('span', class_='text').text author = quote.find('small', class_='author').text tags = [tag.text for tag in quote.find_all('a', class_='tag')] quotes_data.append({ 'text': text, 'author': author, 'tags': ', '.join(tags) }) # 保存到CSV文件 filename = 'quotes.csv' with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile: fieldnames = ['text', 'author', 'tags'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(quotes_data) print(f"数据已保存到 {filename}") print(f"共保存了 {len(quotes_data)} 条名言")

说明：

csv.DictWriter用于写入CSV文件
encoding='utf-8-sig'确保中文正确显示
newline=''避免CSV文件出现空行

完整代码

import requests from bs4 import BeautifulSoup import csv import time def scrape_quotes(url): """ 爬取名言网站的所有名言数据 Args: url (str): 要爬取的网址 Returns: list: 包含所有名言数据的列表 """ try: # 发送HTTP请求 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get(url, headers=headers, timeout=10) # 检查请求是否成功 if response.status_code != 200: print(f"请求失败，状态码：{response.status_code}") return [] # 解析HTML soup = BeautifulSoup(response.text, 'lxml') # 提取数据 quotes_data = [] quotes = soup.find_all('div', class_='quote') for quote in quotes: text = quote.find('span', class_='text').text.strip() author = quote.find('small', class_='author').text.strip() tags = [tag.text.strip() for tag in quote.find_all('a', class_='tag')] quotes_data.append({ 'text': text, 'author': author, 'tags': ', '.join(tags) }) return quotes_data except requests.exceptions.RequestException as e: print(f"请求异常：{e}") return [] except Exception as e: print(f"未知错误：{e}") return [] def save_to_csv(data, filename): """ 将数据保存到CSV文件 Args: data (list): 要保存的数据列表 filename (str): 文件名 """ if not data: print("没有数据可保存") return try: with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile: fieldnames = ['text', 'author', 'tags'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(data) print(f"数据已成功保存到 {filename}") print(f"共保存了 {len(data)} 条记录") except Exception as e: print(f"保存文件时出错：{e}") def main(): """主函数""" # 目标网址 url = "http://quotes.toscrape.com/" print("开始爬取名言数据...") print(f"目标网址：{url}") print("-" * 50) # 爬取数据 quotes_data = scrape_quotes(url) if quotes_data: # 保存数据 save_to_csv(quotes_data, 'quotes.csv') # 显示前5条数据 print(" 爬取的数据示例（前5条）：") print("-" * 50) for i, quote in enumerate(quotes_data[:5], 1): print(f"{i}. {quote['text']}") print(f" —— {quote['author']}") print(f" 标签：{quote['tags']}") print() else: print("爬取失败，请检查网络连接和网址") if __name__ == "__main__": main()

GitHub链接：https://github.com/your-username/python-crawler-tutorial

避坑指南

坑1：中文编码问题

问题：运行程序时出现UnicodeDecodeError，中文显示乱码。

现象：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0

原因：网页编码和程序编码不一致。

解决方案：

# 在发送请求时指定编码 response = requests.get(url) response.encoding = response.apparent_encoding # 自动检测编码 # 或者指定为UTF-8 response.encoding = 'utf-8'

坑2：User-Agent被识别

问题：网站检测到爬虫，返回403错误或验证码。

现象：

requests.exceptions.HTTPError: 403 Client Error: Forbidden

原因：默认的User-Agent被识别为爬虫。

解决方案：

# 使用更真实的User-Agent headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8' } # 或者使用User-Agent池 user_agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' ] import random headers['User-Agent'] = random.choice(user_agents)

坑3：HTML元素定位失败

问题：使用find()方法找不到指定的HTML元素。

现象：

AttributeError: 'NoneType' object has no attribute 'text'

原因：页面结构变化或选择器错误。

解决方案：

# 先检查元素是否存在 quote = soup.find('div', class_='quote') if quote: text = quote.find('span', class_='text').text print(text) else: print("未找到名言元素") # 或者使用更灵活的选择器 quotes = soup.select('div.quote') # 使用CSS选择器 for quote in quotes: text = quote.select_one('span.text').text print(text)

坑4：网络连接超时

问题：请求长时间没有响应，程序卡住。

现象：程序长时间等待，最终抛出超时异常。

原因：网络不稳定或目标网站响应慢。

解决方案：

# 设置超时时间 response = requests.get(url, timeout=10) # 10秒超时 # 使用try-catch处理异常 try: response = requests.get(url, timeout=10) response.raise_for_status() # 检查HTTP状态码 except requests.exceptions.Timeout: print("请求超时") except requests.exceptions.ConnectionError: print("连接错误") except requests.exceptions.RequestException as e: print(f"请求异常：{e}")

坑5：数据保存失败

问题：CSV文件保存失败或数据格式错误。

现象：

UnicodeEncodeError: 'gbk' codec can't encode character

原因：文件编码问题或数据格式不正确。

解决方案：

# 使用正确的编码 with open('quotes.csv', 'w', newline='', encoding='utf-8-sig') as csvfile: writer = csv.writer(csvfile) writer.writerow(['text', 'author', 'tags']) # 写入表头 writer.writerows(data) # 写入数据 # 或者使用pandas import pandas as pd df = pd.DataFrame(data) df.to_csv('quotes.csv', index=False, encoding='utf-8-sig')

效果展示

运行我们的爬虫程序，你会看到类似这样的输出：

开始爬爬取名言数据... 目标网址：http://quotes.toscrape.com/ -------------------------------------------------- 数据已成功保存到 quotes.csv 共保存了 10 条记录 爬取的数据示例（前5条）： -------------------------------------------------- 1. "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." —— Albert Einstein 标签：change deep-thoughts thinking world 2. "It is our choices, Harry, that show what we truly are, far more than our abilities." —— J.K. Rowling 标签: choices 3. "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." —— Albert Einstein 标签: inspirational life live mircale miracles 4. "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." —— Jane Austen 标签: classic literature 5. "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring." —— Marilyn Monroe 标签: be-yourself inspirational

生成的CSV文件内容：

text,author,tags "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",Albert Einstein,"change deep-thoughts thinking world" "It is our choices, Harry, that show what we truly are, far more than our abilities.",J.K. Rowling,choices "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.",Albert Einstein,"inspirational life live mircale miracles" "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.",Jane Austen,"classic literature" "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",Marilyn Monroe,"be-yourself inspirational"