当前位置：首页 > news >正文

Python网络爬虫实战

news 2026/7/6 2:06:52

Python网络爬虫实战

后端转 Rust 的萌新，ID "第一程序员"——名字大，人很菜（暂时）。正在跟所有权和生命周期死磕，日常记录 Rust 学习路上的踩坑经验和"啊哈时刻"，代码片段保证能跑。保持学习，保持输出。欢迎大佬们轻喷，也欢迎同好一起进步。

前言

最近在学习 Python 的过程中，我开始关注网络爬虫。作为一个从后端转 Rust 的萌新，我认为了解 Python 的网络爬虫技术是非常有必要的，它可以帮助我们从互联网上获取数据，为后续的分析和应用做准备。

Python 提供了多种库和工具来进行网络爬虫，如 requests、BeautifulSoup、Scrapy 等。今天，我就来分享一下 Python 网络爬虫的相关知识和实战经验，希望能帮到和我一样的萌新们。

网络爬虫的基本概念

什么是网络爬虫

网络爬虫（Web Crawler）是一种自动化程序，用于从互联网上获取网页内容。它可以按照一定的规则，自动访问网页并提取所需的数据。

网络爬虫的应用场景

数据采集：从网站上获取大量数据
信息监控：监控网站的更新和变化
搜索引擎：为搜索引擎抓取网页内容
价格比较：比较不同网站的商品价格
内容聚合：聚合多个网站的内容

网络爬虫的法律和道德问题

遵守 robots.txt：网站的 robots.txt 文件规定了爬虫的访问规则
控制访问频率：避免对网站服务器造成过大压力
尊重版权：不要侵犯网站的知识产权
保护隐私：不要采集个人隐私信息

网络爬虫的基本流程

发送请求：向目标网站发送 HTTP 请求
获取响应：接收网站返回的 HTML 内容
解析内容：提取所需的数据
存储数据：将提取的数据存储起来
后续处理：对数据进行进一步处理和分析

常用的网络爬虫库

1. requests

requests是一个简单易用的 HTTP 库，用于发送 HTTP 请求。

import requests # 发送 GET 请求 response = requests.get('https://www.example.com') print(response.status_code) print(response.text) # 发送 POST 请求 data = {'username': 'admin', 'password': '123456'} response = requests.post('https://www.example.com/login', data=data) print(response.text) # 设置 headers headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get('https://www.example.com', headers=headers) print(response.text)

2. BeautifulSoup

BeautifulSoup是一个 HTML 和 XML 解析库，用于提取网页中的数据。

from bs4 import BeautifulSoup import requests # 获取网页内容 response = requests.get('https://www.example.com') soup = BeautifulSoup(response.text, 'html.parser') # 提取标题 title = soup.title.string print(title) # 提取所有链接 links = soup.find_all('a') for link in links: print(link.get('href')) # 提取特定类的元素 items = soup.find_all(class_='item') for item in items: print(item.text)

3. Scrapy

Scrapy是一个功能强大的网络爬虫框架，用于大规模的数据采集。

import scrapy class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['https://www.example.com'] def parse(self, response): # 提取标题 title = response.css('title::text').get() yield {'title': title} # 提取链接 links = response.css('a::attr(href)').getall() for link in links: yield {'link': link}

实战案例：爬取 GitHub trending 页面

1. 分析页面结构

首先，我们需要分析 GitHub trending 页面的结构，确定需要提取的数据和提取方法。

2. 编写爬虫代码

import requests from bs4 import BeautifulSoup # 发送请求 url = 'https://github.com/trending' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get(url, headers=headers) # 解析内容 soup = BeautifulSoup(response.text, 'html.parser') # 提取项目信息 projects = [] for project in soup.find_all('article', class_='Box-row'): # 提取项目名称 name = project.find('h2', class_='h3 lh-condensed').text.strip().replace('\n', '').replace(' ', '') # 提取项目描述 description = project.find('p', class_='col-9 text-gray my-1 pr-4') description = description.text.strip() if description else '' # 提取语言 language = project.find('span', class_='text-gray-dark mr-3') language = language.text.strip() if language else '' # 提取 stars stars = project.find('a', class_='Link--muted d-inline-block mr-3').text.strip() # 提取 forks forks = project.find_all('a', class_='Link--muted d-inline-block mr-3') forks = forks[1].text.strip() if len(forks) > 1 else '' # 提取今日 stars today_stars = project.find('span', class_='d-inline-block float-sm-right') today_stars = today_stars.text.strip() if today_stars else '' projects.append({ 'name': name, 'description': description, 'language': language, 'stars': stars, 'forks': forks, 'today_stars': today_stars }) # 打印结果 for project in projects: print(f"Name: {project['name']}") print(f"Description: {project['description']}") print(f"Language: {project['language']}") print(f"Stars: {project['stars']}") print(f"Forks: {project['forks']}") print(f"Today Stars: {project['today_stars']}") print('-' * 50)

3. 存储数据

import csv # 存储为 CSV 文件 with open('github_trending.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['Name', 'Description', 'Language', 'Stars', 'Forks', 'Today Stars']) for project in projects: writer.writerow([ project['name'], project['description'], project['language'], project['stars'], project['forks'], project['today_stars'] ]) print('数据已存储到 github_trending.csv')

网络爬虫的最佳实践

1. 设置合理的请求头

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1' }

2. 控制访问频率

import time for url in urls: response = requests.get(url, headers=headers) # 处理响应 time.sleep(1) # 暂停 1 秒

3. 使用代理

proxies = { 'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890' } response = requests.get(url, headers=headers, proxies=proxies)

4. 处理异常

try: response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # 检查状态码 except requests.exceptions.RequestException as e: print(f"Error: {e}")

5. 使用会话管理

session = requests.Session() session.headers.update(headers) # 登录 data = {'username': 'admin', 'password': '123456'} session.post('https://www.example.com/login', data=data) # 访问需要登录的页面 response = session.get('https://www.example.com/dashboard')