当前位置：首页 > news >正文

从技能大赛样题到实战项目：手把手教你用Python爬取天气数据并存入MySQL（附反爬策略）

news 2026/7/24 8:07:42

从技能大赛到真实项目：Python天气数据采集实战全解析

最近在整理过往项目时，发现很多初学者在参加完技能大赛后，往往不知道如何将比赛中学到的知识应用到实际工作中。特别是网络爬虫这类看似简单实则暗藏玄机的技术，比赛中的标准答案在实际场景中常常会遇到各种意外情况。今天我就以最常见的天气数据采集为例，分享一个完整的项目开发流程，从基础爬取到反爬策略，再到数据存储，带你体验真实项目开发的完整生命周期。

1. 项目规划与环境准备

在开始编码之前，我们需要先明确项目的目标和范围。不同于比赛中的固定题目，真实项目需要考虑更多实际因素。我们的目标是采集青岛、开封等10个城市的历史天气数据，包括城市、日期、最高气温、最低气温、天气状况和风向等信息。

1.1 工具选型与安装

对于Python爬虫项目，我们通常会选择以下工具链：

pip install requests beautifulsoup4 pymysql pandas

requests：比urllib更人性化的HTTP请求库
beautifulsoup4：HTML解析神器
pymysql：Python操作MySQL的接口
pandas：数据处理和分析工具

1.2 数据库设计

在MySQL中创建存储天气数据的表结构：

CREATE TABLE `weather_data` ( `id` int(11) NOT NULL AUTO_INCREMENT, `city` varchar(50) NOT NULL, `date` date NOT NULL, `high_temp` int(11) DEFAULT NULL, `low_temp` int(11) DEFAULT NULL, `weather` varchar(50) DEFAULT NULL, `wind_direction` varchar(50) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), UNIQUE KEY `idx_city_date` (`city`,`date`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

注意：这里设置了city和date的联合唯一索引，避免重复插入相同数据

2. 基础爬虫实现

2.1 页面请求与解析

我们先从一个简单的请求开始，获取目标网站的HTML内容：

import requests from bs4 import BeautifulSoup def fetch_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } try: response = requests.get(url, headers=headers) response.raise_for_status() return response.text except requests.exceptions.RequestException as e: print(f"请求失败: {e}") return None

2.2 数据提取技巧

使用BeautifulSoup解析HTML时，定位元素有多种方法：

def parse_weather_data(html): soup = BeautifulSoup(html, 'html.parser') weather_list = [] # 示例：假设数据在class为weather-item的div中 for item in soup.find_all('div', class_='weather-item'): date = item.find('span', class_='date').text.strip() high_temp = item.find('span', class_='high-temp').text.replace('℃', '') # 其他字段类似提取... weather_data = { 'date': date, 'high_temp': int(high_temp), # 其他字段... } weather_list.append(weather_data) return weather_list

3. 应对反爬机制实战

真实网站通常会有各种反爬措施，我们需要针对性地解决。

3.1 常见反爬手段与对策

反爬类型	检测方式	应对策略
User-Agent检测	检查请求头中的UA	随机轮换UA
IP限制	单个IP请求频率	使用代理IP池
行为分析	鼠标移动、点击模式	模拟人类操作间隔
验证码	图片或滑动验证	识别服务或手动处理
动态渲染	JavaScript加载内容	使用Selenium/Puppeteer

3.2 请求头优化实践

一个完善的请求头应该包含以下信息：

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Accept-Encoding': 'gzip, deflate, br', 'Connection': 'keep-alive', 'Referer': 'https://www.example.com/', 'Upgrade-Insecure-Requests': '1', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-User': '?1', 'Cache-Control': 'max-age=0' }

3.3 请求频率控制

避免被封锁的关键是模拟人类浏览行为：

import random import time def random_delay(): time.sleep(random.uniform(1, 3)) # 1-3秒随机延迟 def crawl_city_data(city): url = f"https://example.com/weather/{city}" html = fetch_page(url) random_delay() return parse_weather_data(html)

4. 数据存储与优化

4.1 数据库操作封装

使用连接池提高数据库操作效率：

import pymysql from dbutils.pooled_db import PooledDB class MySQLHelper: def __init__(self): self.pool = PooledDB( creator=pymysql, host='localhost', user='root', password='password', database='weather', maxconnections=5, blocking=True ) def execute(self, sql, args=None): conn = self.pool.connection() cursor = conn.cursor() try: cursor.execute(sql, args) conn.commit() return cursor.rowcount except Exception as e: conn.rollback() raise e finally: cursor.close() conn.close()

4.2 批量插入优化

单条插入效率低下，使用批量插入提高性能：

def batch_insert_weather_data(data_list): sql = """INSERT IGNORE INTO weather_data (city, date, high_temp, low_temp, weather, wind_direction) VALUES (%s, %s, %s, %s, %s, %s)""" db = MySQLHelper() batch_size = 100 # 每批100条 for i in range(0, len(data_list), batch_size): batch = data_list[i:i+batch_size] params = [(d['city'], d['date'], d['high_temp'], d['low_temp'], d['weather'], d['wind_direction']) for d in batch] db.execute_many(sql, params)

5. 项目扩展与优化方向

5.1 异常处理与日志记录

完善的异常处理是生产级代码的必备特性：

import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler('weather_crawler.log'), logging.StreamHandler() ] ) def safe_crawl(city): try: data = crawl_city_data(city) if data: batch_insert_weather_data(data) logging.info(f"{city}数据采集成功，共{len(data)}条") else: logging.warning(f"{city}未获取到数据") except Exception as e: logging.error(f"{city}采集失败: {str(e)}", exc_info=True)

5.2 分布式爬虫架构

当数据量增大时，可以考虑分布式架构：

主节点：负责URL调度和任务分配
工作节点：执行实际爬取任务
消息队列：使用Redis或RabbitMQ进行通信
去重机制：Bloom过滤器或Redis集合

5.3 数据质量监控

建立数据质量检查机制：

def data_quality_check(data): # 温度合理性检查 if data['high_temp'] < data['low_temp']: raise ValueError("最高温度低于最低温度") # 日期格式验证 if not re.match(r'^\d{4}-\d{2}-\d{2}$', data['date']): raise ValueError("日期格式不正确") # 其他业务规则检查...

在实际项目中，爬虫只是数据流水线的第一步。完整的数据工程还包括数据清洗、存储、分析和可视化等多个环节。这个天气数据采集项目虽然看似简单，但已经包含了真实项目中的大部分核心要素。

查看全文

http://www.jsqmd.com/news/595620/