当前位置：首页 > news >正文

寒假学习笔记2.6

news 2026/3/26 20:32:43

一、实践练习：综合实战——豆瓣电影Top250数据采集与分析
项目目标
爬取豆瓣电影Top250的电影名称、年份、评分和评价人数

对数据进行清洗和统计分析

可视化展示分析结果

步骤1：数据采集
python
import requests
from bs4 import BeautifulSoup
import time
import csv

def fetch_douban_top250():
"""爬取豆瓣电影Top250"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

movies = []for start in range(0, 250, 25):  # 分页，每页25部url = f'https://movie.douban.com/top250?start={start}'try:response = requests.get(url, headers=headers)response.raise_for_status()soup = BeautifulSoup(response.text, 'html.parser')# 解析电影列表movie_items = soup.find_all('div', class_='item')for item in movie_items:# 电影名称title = item.find('span', class_='title').text# 电影信息（年份、导演等）info = item.find('p', class_='').text.strip()# 提取年份（简化处理）year = info.split('/')[-1].strip() if '/' in info else '未知'# 评分rating = item.find('span', class_='rating_num').text# 评价人数quote = item.find('span', class_='quote')if quote and quote.find('span', class_='inq'):comment = quote.find('span', class_='inq').textelse:comment = ''movies.append({'title': title,'year': year,'rating': float(rating),'comment': comment})print(f"已爬取第{start//25 + 1}页，当前总数：{len(movies)}")time.sleep(2)  # 礼貌延迟，避免被封except Exception as e:print(f"爬取失败：{e}")breakreturn movies

执行爬取

movies = fetch_douban_top250()
print(f"共爬取 {len(movies)} 部电影")

保存到CSV

with open('douban_top250.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'year', 'rating', 'comment'])
writer.writeheader()
writer.writerows(movies)
步骤2：数据分析
python
import pandas as pd
import numpy as np

读取数据

df = pd.read_csv('douban_top250.csv')

数据概览

print("数据形状：", df.shape)
print("\n前5行：")
print(df.head())

print("\n数据类型：")
print(df.dtypes)

print("\n基本统计：")
print(df['rating'].describe())

年份处理

df['year'] = df['year'].str.extract('(\d+)').astype(float) # 提取数字年份
df = df.dropna(subset=['year']) # 删除缺失年份的行

按年代分组统计

df['decade'] = (df['year'] // 10) * 10 # 计算年代
decade_stats = df.groupby('decade').agg({
'title': 'count',
'rating': ['mean', 'max', 'min']
}).round(2)

print("\n按年代统计：")
print(decade_stats)
步骤3：数据可视化
python
import matplotlib.pyplot as plt

设置中文字体

plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号

1. 评分分布直方图

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df['rating'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('评分')
plt.ylabel('电影数量')
plt.title('豆瓣Top250评分分布')
plt.grid(True, alpha=0.3)

2. 不同年代电影数量

plt.subplot(1, 2, 2)
decade_counts = df['decade'].value_counts().sort_index()
plt.bar(decade_counts.index.astype(str), decade_counts.values,
color='skyblue', edgecolor='black')
plt.xlabel('年代')
plt.ylabel('电影数量')
plt.title('不同年代电影数量')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('movie_analysis1.png', dpi=300)
plt.show()

3. 年代与评分关系

plt.figure(figsize=(10, 6))
plt.scatter(df['year'], df['rating'], alpha=0.5)
plt.xlabel('年份')
plt.ylabel('评分')
plt.title('电影年份与评分关系')
z = np.polyfit(df['year'], df['rating'], 1)
p = np.poly1d(z)
plt.plot(df['year'], p(df['year']), 'r--', alpha=0.8, label='趋势线')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('movie_analysis2.png', dpi=300)
plt.show()

4. 评分最高的10部电影

top10 = df.nlargest(10, 'rating')[['title', 'year', 'rating']]

plt.figure(figsize=(10, 6))
plt.barh(range(len(top10)), top10['rating'], color='gold')
plt.yticks(range(len(top10)), top10['title'])
plt.xlabel('评分')
plt.title('豆瓣Top250评分最高的10部电影')
for i, (_, row) in enumerate(top10.iterrows()):
plt.text(row['rating'] - 0.1, i, f"{row['rating']}",
ha='right', va='center', fontweight='bold')
plt.tight_layout()
plt.savefig('movie_analysis3.png', dpi=300)
plt.show()

print("分析完成！图表已保存。")
四、遇到的问题与解决
问题：爬取时返回403 Forbidden

解决：设置合理的User-Agent，模拟浏览器请求

技巧：可以轮换多个User-Agent，添加Referer头

问题：中文显示乱码

解决：plt.rcParams['font.sans-serif']设置中文字体，如'SimHei'

问题：matplotlib图表不显示

解决：确保调用plt.show()，或在Jupyter中使用%matplotlib inline

问题：BeautifulSoup解析慢

解决：安装lxml解析器pip install lxml，使用soup = BeautifulSoup(html, 'lxml')

问题：NumPy数组运算类型不匹配

解决：检查dtype，使用astype()转换类型

二、学习总结
掌握了pip包管理和虚拟环境的使用，为项目开发奠定基础