当前位置：首页 > news >正文

从0到1掌握Selenium动态网页爬取：微博热搜完整实战与反爬全攻略

news 2026/4/30 9:08:04

在网络爬虫领域，静态网页爬取相对简单，通过requests库配合BeautifulSoup就能轻松搞定。但当我们面对微博、抖音、知乎这类大量使用JavaScript动态渲染内容的网站时，传统的静态爬虫就会束手无策——你会发现请求返回的HTML源码里根本没有你想要的数据，它们都是在浏览器加载页面后通过AJAX异步请求并动态生成的。

这时候，Selenium就成了我们的利器。它可以模拟真实用户的浏览器行为，完整加载并执行页面中的JavaScript代码，让我们能够获取到和浏览器中看到的完全一致的页面内容。本文将以最具代表性的动态网页之一——微博热搜榜为例，带你从环境搭建到完整代码实现，再到反爬机制应对，全方位掌握Selenium动态网页爬取的核心技术。

一、为什么选择Selenium爬取动态网页？

在开始实战之前，我们先搞清楚一个问题：面对动态网页，为什么首选Selenium？它和其他方案相比有什么优势和劣势？

1.1 动态网页爬取的常见方案对比

方案	原理	优点	缺点	适用场景
直接抓包AJAX接口	分析浏览器网络请求，直接调用数据接口	速度快、效率高、资源占用少	接口分析难度大，容易被反爬，接口可能频繁变化	接口清晰、加密简单的网站
Selenium	模拟真实浏览器，完整渲染页面	无需分析接口，兼容性好，能处理复杂JS逻辑	速度较慢，资源占用高	接口复杂、加密严格、JS逻辑复杂的网站
Playwright	微软推出的新一代浏览器自动化工具	比Selenium更快更稳定，支持更多浏览器	生态不如Selenium成熟，学习成本稍高	新项目、对性能要求较高的场景
Pyppeteer	Puppeteer的Python版本	基于Chrome DevTools协议，速度快	维护不活跃，兼容性问题较多	小型项目、个人使用

1.2 Selenium的核心优势

对于大多数开发者，尤其是爬虫新手来说，Selenium是性价比最高的选择：

零接口分析成本：不需要花费大量时间去逆向分析网站的AJAX请求和加密算法
所见即所得：浏览器中能看到的内容，Selenium几乎都能获取到
强大的交互能力：可以模拟点击、输入、滚动、拖拽等所有用户操作
成熟的生态系统：拥有丰富的文档和社区支持，遇到问题容易找到解决方案
跨浏览器支持：支持Chrome、Firefox、Edge、Safari等主流浏览器

二、环境搭建：一步到位的配置指南

工欲善其事，必先利其器。我们先把Selenium的运行环境搭建好。

2.1 安装Python和必要库

首先确保你已经安装了Python 3.7及以上版本，然后通过pip安装以下库：

# 安装Selenium核心库pipinstallselenium# 自动管理浏览器驱动（强烈推荐，无需手动下载）pipinstallwebdriver-manager# 数据处理和存储pipinstallpandas# 用于生成随机请求间隔pipinstalltimerandom

2.2 浏览器驱动自动管理

在Selenium 4.x版本之前，我们需要手动下载对应浏览器版本的驱动程序，并且每次浏览器更新后都要重新下载，非常麻烦。现在有了webdriver-manager，这个问题就彻底解决了，它会自动检测你本地的浏览器版本并下载对应的驱动。

三、微博热搜页面深度分析

在写代码之前，我们必须先对目标页面进行详细分析，这是爬虫开发中最重要的一步。

3.1 页面结构分析

打开微博热搜榜页面：https://s.weibo.com/top/summary?cate=realtimehot

按下F12打开开发者工具，切换到"Elements"标签页，我们可以看到整个页面的HTML结构。通过元素选择器工具点击热搜条目，可以发现每条热搜都包含在一个<tr>标签中，而所有的热搜条目都在一个<tbody>标签内。

每条热搜的结构大致如下：

<trclass=""><tdclass="td-01 ranktop">1</td><tdclass="td-02"><ahref="/weibo?q=%23...%23"target="_blank">热搜标题</a><span>热度值</span></td><tdclass="td-03"><iclass="icon-txt icon-txt-hot">热</i></td></tr>

3.2 动态加载特点分析

微博热搜页面有一个特点：当你滚动到页面底部时，会自动加载更多的历史热搜。但对于实时热搜榜来说，默认已经加载了全部50条热搜，所以我们不需要处理滚动加载的问题。不过，我们仍然需要等待页面完全加载完成，否则可能会出现元素找不到的错误。

3.3 爬取流程设计

根据以上分析，我们可以设计出完整的爬取流程：

四、核心代码实现：从基础到完善

现在我们开始编写代码，先从最基础的版本开始，然后逐步完善，加入异常处理、反爬措施等。

4.1 基础版本：实现核心爬取功能

fromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByimportpandasaspdimporttimedefscrape_weibo_hot_search():# 初始化Chrome浏览器driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 访问微博热搜页面url="https://s.weibo.com/top/summary?cate=realtimehot"driver.get(url)# 等待页面加载完成（强制等待3秒，后续会优化）time.sleep(3)# 定位所有热搜条目hot_items=driver.find_elements(By.XPATH,'//tbody/tr')# 存储爬取结果hot_search_list=[]# 遍历每个条目，提取信息forindex,iteminenumerate(hot_items):try:# 跳过置顶热搜（如果有的话）ifindex==0and"置顶"initem.text:continue# 提取排名rank=item.find_element(By.XPATH,'./td[1]').text.strip()# 提取标题和链接title_element=item.find_element(By.XPATH,'./td[2]/a')title=title_element.text.strip()link=title_element.get_attribute('href')# 提取热度值heat=item.find_element(By.XPATH,'./td[2]/span').text.strip()# 提取标签（如"热"、"爆"、"新"）tag_elements=item.find_elements(By.XPATH,'./td[3]/i')tag=tag_elements[0].text.strip()iftag_elementselse""# 添加到结果列表hot_search_list.append({'排名':rank,'标题':title,'热度':heat,'标签':tag,'链接':link})print(f"已爬取：{rank}-{title}")exceptExceptionase:print(f"爬取第{index+1}条热搜时出错：{e}")continue# 将数据保存到CSV文件df=pd.DataFrame(hot_search_list)df.to_csv('微博热搜榜.csv',index=False,encoding='utf-8-sig')print(f"\n爬取完成！共获取到{len(hot_search_list)}条热搜，已保存到微博热搜榜.csv")returnhot_search_listexceptExceptionase:print(f"爬取过程中发生错误：{e}")returnNonefinally:# 无论是否出错，都关闭浏览器driver.quit()if__name__=="__main__":scrape_weibo_hot_search()

4.2 优化版本：加入智能等待和反爬措施

上面的基础版本虽然能运行，但存在几个问题：

使用time.sleep(3)强制等待，效率低下且不稳定
没有伪装浏览器请求头，容易被检测为爬虫
没有设置请求间隔，频繁请求容易被封禁

下面我们来优化这些问题：

fromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.uiimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECimportpandasaspdimporttimeimportrandomdefscrape_weibo_hot_search_optimized():# 配置Chrome选项chrome_options=webdriver.ChromeOptions()# 添加请求头伪装chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36')# 禁用自动化特征，防止被检测chrome_options.add_experimental_option("excludeSwitches",["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension',False)# 初始化浏览器driver=webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=chrome_options)# 执行JavaScript代码，隐藏webdriver属性driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",{"source":""" Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """})try:url="https://s.weibo.com/top/summary?cate=realtimehot"driver.get(url)# 智能等待：直到tbody元素出现，最多等待10秒wait=WebDriverWait(driver,10)wait.until(EC.presence_of_element_located((By.XPATH,'//tbody')))# 随机等待1-2秒，模拟人类行为time.sleep(random.uniform(1,2))hot_items=driver.find_elements(By.XPATH,'//tbody/tr')hot_search_list=[]forindex,iteminenumerate(hot_items):try:ifindex==0and"置顶"initem.text:continuerank=item.find_element(By.XPATH,'./td[1]').text.strip()title_element=item.find_element(By.XPATH,'./td[2]/a')title=title_element.text.strip()link=title_element.get_attribute('href')heat=item.find_element(By.XPATH,'./td[2]/span').text.strip()tag_elements=item.find_elements(By.XPATH,'./td[3]/i')tag=tag_elements[0].text.strip()iftag_elementselse""hot_search_list.append({'排名':rank,'标题':title,'热度':heat,'标签':tag,'链接':link})print(f"已爬取：{rank}-{title}")# 每条数据之间随机等待0.5-1秒time.sleep(random.uniform(0.5,1))exceptExceptionase:print(f"爬取第{index+1}条热搜时出错：{e}")continuedf=pd.DataFrame(hot_search_list)df.to_csv('微博热搜榜_优化版.csv',index=False,encoding='utf-8-sig')print(f"\n爬取完成！共获取到{len(hot_search_list)}条热搜，已保存到微博热搜榜_优化版.csv")returnhot_search_listexceptExceptionase:print(f"爬取过程中发生错误：{e}")returnNonefinally:driver.quit()if__name__=="__main__":scrape_weibo_hot_search_optimized()

五、进阶：微博反爬机制与应对策略

微博作为国内顶级的社交平台，拥有非常完善的反爬机制。上面的优化版本已经能应对大部分情况，但如果需要大规模、高频率爬取，还需要了解更多的反爬应对策略。

5.1 常见的反爬机制及应对

检测自动化特征
- 表现：Selenium启动的浏览器会有一些特征，比如navigator.webdriver属性为true
- 应对：通过Chrome选项和JavaScript代码隐藏这些特征，或者使用undetected-chromedriver库
请求频率限制
- 表现：短时间内大量请求会被封禁IP，返回403错误
- 应对：添加随机请求间隔，使用代理IP池
Cookie验证
- 表现：需要登录才能访问某些内容
- 应对：模拟登录获取Cookie，或者使用已登录的浏览器配置文件
验证码
- 表现：频繁请求后会出现滑块验证码或图形验证码
- 应对：使用打码平台，或者降低爬取频率避免触发验证码

5.2 使用undetected-chromedriver绕过检测

undetected-chromedriver是一个专门为绕过反爬检测而优化的Chrome驱动，它能隐藏几乎所有的自动化特征，是目前应对反爬最有效的工具之一。

安装方法：

pipinstallundetected-chromedriver

使用示例：

importundetected_chromedriverasucimportpandasaspdimporttimeimportrandomdefscrape_weibo_with_undetected():driver=uc.Chrome()try:url="https://s.weibo.com/top/summary?cate=realtimehot"driver.get(url)time.sleep(random.uniform(2,3))hot_items=driver.find_elements(By.XPATH,'//tbody/tr')hot_search_list=[]forindex,iteminenumerate(hot_items):try:ifindex==0and"置顶"initem.text:continuerank=item.find_element(By.XPATH,'./td[1]').text.strip()title_element=item.find_element(By.XPATH,'./td[2]/a')title=title_element.text.strip()link=title_element.get_attribute('href')heat=item.find_element(By.XPATH,'./td[2]/span').text.strip()tag_elements=item.find_elements(By.XPATH,'./td[3]/i')tag=tag_elements[0].text.strip()iftag_elementselse""hot_search_list.append({'排名':rank,'标题':title,'热度':heat,'标签':tag,'链接':link})time.sleep(random.uniform(0.3,0.8))exceptExceptionase:print(f"爬取第{index+1}条热搜时出错：{e}")continuedf=pd.DataFrame(hot_search_list)df.to_csv('微博热搜榜_undetected版.csv',index=False,encoding='utf-8-sig')print(f"\n爬取完成！共获取到{len(hot_search_list)}条热搜")returnhot_search_listfinally:driver.quit()if__name__=="__main__":scrape_weibo_with_undetected()

六、性能优化与功能扩展

6.1 无头模式运行

在服务器上运行爬虫时，我们不需要显示浏览器界面，可以使用无头模式，这样能大大减少资源占用：

chrome_options.add_argument('--headless=new')# Chrome 112+版本使用chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--window-size=1920,1080')

6.2 定时爬取

我们可以使用schedule库实现定时爬取，比如每小时爬取一次微博热搜：

pipinstallschedule

importscheduledefjob():print("开始爬取微博热搜...")scrape_weibo_hot_search_optimized()# 每小时执行一次schedule.every().hour.do(job)whileTrue:schedule.run_pending()time.sleep(1)