当前位置：首页 > news >正文

python爬虫学习 - Wild

news 2026/4/9 1:21:43

首先实在官网下载pythonPython Software Foundation，我是装在默认路径

我比较喜欢用VScode，添加python插件、jupyter插件

安装jupyter包

pip install jupyter -i https://pypi.tuna.tsinghua.edu.cn/simple

安装爬虫所需要的2个包，分别是BeautifulSoup4和xlwt

pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple

别人上来就获取电影评分、概况等等的完整代码对于我来说太难理解了，毕竟我只是一个小白

现在我要分模块学习

模块一：Python 基础与 HTTP 请求（Requests）

任务1：获取指定网页的 HTML 内容（百度首页）

基本的捕获框架

import requests  # 导入库# 1. 定义要访问的网址
url = "https://www.baidu.com/"# 2. 发送 GET 请求，获取响应
response = requests.get(url)# 3. 设置编码，可以尝试去掉这一行，会发现去掉后乱码 
response.encoding = "utf-8"# 4. 打印响应内容（HTML）
print(response.text)

可以看到捕获了html基本内容

headers请求头

网站会使用一项叫做反爬的技术，检查请求头里的 User-Agent 字段，判断你是浏览器还是爬虫。所以我们要将自己伪装成正常用户用浏览器访问，提高请求成功率。

而User-Agent 字段是一段字符串，告诉服务器：“我是来自 Windows 10 的 Chrome 浏览器，版本是 xxx，我能处理什么样的内容。”

最基本的headers框架

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}# 发送请求时带上 headers
response = requests.get(url, headers=headers)

爬虫有一个较为重要的方法——try...except

try...except 是 Python 的异常处理机制，用来捕获代码执行中可能出现的错误（比如网络断了、网站挂了、超时了）
不让程序直接崩溃，而是提示错误信息，方便排查问题

现在我们结合前几个知识应用try...except，现在是正常运行状态

import requests
from bs4 import BeautifulSouptry:# 核心逻辑：请求百度首页并提取descriptionresp = requests.get("https://www.baidu.com", headers={"User-Agent": "Chrome/120"})desc = BeautifulSoup(resp.text, "html.parser").find("meta", {"name": "description"})["content"]print("正常运行结果：", desc ) 
except Exception as e:print(f"捕获到异常：{e}")

正常捕获百度的description

现在我们故意写错url，体现一下try...except的作用

import requests
from bs4 import BeautifulSouptry:
    resp = requests.get("https://www.baidu.com/error-url", headers={"User-Agent": "Chrome/120"})desc = BeautifulSoup(resp.text, "html.parser").find("meta", {"name": "description"})["content"]print("正常运行结果：", desc ) 
except Exception as e:print(f"捕获到异常：{e}")

因为url是不存在的，所以不可能捕获的到，这时程序没有因为捕获空地址而崩溃，而是提示我们捕获到异常，并提示异常内容

假如说没有使用try...exception

系统会直接崩溃，停止运行

requests库还有很多自带的捕捉异常的方法，可以让异常的捕获更准确

可以更精确的提示是xx异常，而不是统一的“捕获到异常”

import requests
from bs4 import BeautifulSoup
try:url="https://www.baidu.com/error-url"# url = "https://www.fake-domain-12345.com"headers={"User-Agent": "Chrome/120"}response = requests.get(url, headers=headers, timeout=10)# response = requests.get(url, headers=headers, timeout=0.01)
except requests.exceptions.Timeout:print("请求超时了！")
except requests.exceptions.ConnectionError:print("连不上网或者网站挂了！")
except Exception as e:print(f"其他未知错误：{e}")

当你执行 response = requests.get(url) 后，response 是一个包含了服务器所有返回信息的对象，常用属性有：

属性	作用
`response.status_code`	HTTP 状态码（200 = 成功，404 = 找不到，500 = 服务器错误）
`response.text`	响应的文本内容（比如 HTML 代码）
`response.content`	响应的二进制内容（比如图片、文件）
`response.encoding`	文本编码（默认 ISO-8859-1，容易乱码）
`response.apparent_encoding`	自动识别的编码（更准确，用来解决乱码）

if __name__ == "__main__"

只有当这个 Python 文件被直接运行时，才执行 if 块里的代码；如果这个文件被当作模块导入到其他文件里，if 块里的代码就不会执行。

作用是防止程序意外执行，别人导入你的模块，直接开始爬网站

import requestsdef get_baidu_html():"""最简单的获取百度首页HTML并捕获异常的函数"""try:# 1. 发送请求（极简请求头，仅避免基础拦截）response = requests.get("https://www.baidu.com", headers={"User-Agent": "Python"})# 2. 验证请求成功（状态码200）
        response.raise_for_status()print("成功获取百度首页HTML：")print(response.text)response.encoding = response.apparent_encoding  # 解决乱码except Exception as e:# 捕获所有异常并简单提示print(f"出错了：{e}")# 只有直接运行这个文件时，才调用函数
if __name__ == "__main__":get_baidu_html()

现在我们结合所有知识来简单的爬取百度首页

import requestsdef get_baidu_html():# 1. 定义请求头（模拟浏览器）headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}# 2. 发送GET请求url = "https://www.baidu.com"try:response = requests.get(url, headers=headers)# 3. 检查请求是否成功（状态码200）if response.status_code == 200:# 4. 自动识别编码，避免乱码response.encoding = response.apparent_encoding# 5. 打印前500字符print("百度首页HTML：")print(response.text)else:print(f"请求失败，状态码：{response.status_code}")except Exception as e:print(f"请求异常：{e}")if __name__ == "__main__":get_baidu_html()