当前位置：首页 > news >正文

Python爬虫实战（十二）：视频数据采集与批量下载

news 2026/7/24 1:52:25

声明：本文仅供学习交流，请勿用于非法用途

一、前言

今天，我们将进入第十二个实战项目——快手视频数据采集与批量下载。快手作为国内头部短视频平台，其数据获取方式与常规网站有所不同，本文将详细介绍如何通过快手GraphQL接口实现视频数据的精准抓取、批量下载及结构化存储。

二、需求分析

2.1 爬取目标

目标网站：快手官网 (https://www.kuaishou.com)
目标数据：指定博主的视频标题、点赞数、封面图、视频地址
数据存储：Excel表格 + 本地图片/视频文件
技术难点：GraphQL接口分析、分页游标处理、Cookie动态更新

2.2 技术选型

技术/库	用途
`requests`	发送HTTP请求
`pandas`	数据结构化与Excel导出
`json`	JSON数据解析
`os`	文件夹创建与管理
`time`	请求间隔控制

三、数据来源分析

3.1 接口定位

快手网页版采用GraphQL作为数据交互协议，所有数据请求统一发送至https://www.kuaishou.com/graphql。

与传统REST API的区别：

REST：每个资源对应一个URL，获取关联数据需多次请求
GraphQL：单一端点，通过Query语句精确指定所需字段，一次请求获取完整数据

3.2 接口分析步骤

打开Chrome开发者工具（F12）→ Network → Fetch/XHR
访问目标博主主页（如：https://www.kuaishou.com/profile/3xb4sru7rrgesjm）
筛选graphql请求，找到visionProfilePhotoList操作
右键 → Copy → Copy as cURL，提取Headers和Payload

3.3 请求结构解析

GraphQL请求体包含三个核心字段：

{"operationName":"visionProfilePhotoList","variables":{"userId":"3xb4sru7rrgesjm","pcursor":"","page":"profile"},"query":"..."}

分页机制：响应中的pcursor字段即为下一页的游标，将其赋值给variables.pcursor即可实现翻页，直到返回空数据为止。

四、代码实现

4.1 完整代码

importrequestsimportpandasaspdimportjsonimportosimporttime# ============================================# 一、请求头配置（Cookie需每日更新）# ============================================headers={'Accept':'*/*','Accept-Encoding':'gzip, deflate, br, zstd','Accept-Language':'zh-CN,zh;q=0.9','Connection':'keep-alive','Content-Type':'application/json','Cookie':'kpf=PC_WEB; clientid=3; did=web_2d75a480fb3d3ae038d7bd0914204e6b; kpn=KUAISHOU_VISION','Host':'www.kuaishou.com','Origin':'https://www.kuaishou.com','Referer':'https://www.kuaishou.com/profile/3xb4sru7rrgesjm','Sec-Ch-Ua':'"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"','Sec-Ch-Ua-Mobile':'?0','Sec-Ch-Ua-Platform':'"Windows"','Sec-Fetch-Dest':'empty','Sec-Fetch-Mode':'cors','Sec-Fetch-Site':'same-origin','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}# ============================================# 二、GraphQL查询体构造# ============================================json_data={"operationName":"visionProfilePhotoList","variables":{"userId":"","pcursor":"","page":"profile"},"query":"fragment photoContent on PhotoEntity { __typename id duration caption originCaption likeCount viewCount commentCount realLikeCount coverUrl photoUrl photoH265Url manifest manifestH265 videoResource coverUrls { url __typename } timestamp expTag animatedCoverUrl distance videoRatio liked stereoType profileUserTopPhoto musicBlocked riskTagContent riskTagUrl } fragment recoPhotoFragment on recoPhotoEntity { __typename id duration caption originCaption likeCount viewCount commentCount realLikeCount coverUrl photoUrl photoH265Url manifest manifestH265 videoResource coverUrls { url __typename } timestamp expTag animatedCoverUrl distance videoRatio liked stereoType profileUserTopPhoto musicBlocked riskTagContent riskTagUrl } fragment feedContent on Feed { type author { id name headerUrl following headerUrls { url __typename } __typename } photo { ...photoContent ...recoPhotoFragment __typename } canAddComment llsid status currentPcursor tags { type name __typename } __typename } query visionProfilePhotoList($pcursor: String, $userId: String, $page: String, $webPageArea: String) { visionProfilePhotoList(pcursor: $pcursor, userId: $userId, page: $page, webPageArea: $webPageArea) { result llsid webPageArea feeds { ...feedContent __typename } hostName pcursor __typename } }"}# 设置目标用户IDuser_id='3xb4sru7rrgesjm'json_data['variables']['userId']=user_id# 快手GraphQL接口地址url='https://www.kuaishou.com/graphql'# ============================================# 三、目录初始化# ============================================ifnotos.path.exists('image'):os.makedirs('image')ifnotos.path.exists('video'):os.makedirs('video')# ============================================# 四、核心功能函数# ============================================defsend_post():# 发送GraphQL请求并处理分页, 返回当前页视频列表res=requests.post(url=url,headers=headers,json=json_data)text_lados=json.loads(res.text)feeds=text_lados['data']['visionProfilePhotoList']['feeds']# 关键: 提取分页游标, 实现自动翻页json_data['variables']['pcursor']=text_lados['data']['visionProfilePhotoList']['pcursor']returnfeedsdefkuaishou_data(item):# 数据提取: 从单个Feed中提取结构化信息ky_item={}ky_item['标题']=item['photo']['caption']ky_item['点赞数量']=item['photo']['likeCount']ky_item['图片地址']=item['photo']['coverUrl']ky_item['视频地址']=item['photo']['photoUrl']print(ky_item)returnky_itemdefdownload_img(i,item):# 封面图下载(流式传输, 节省内存)img_url=item['photo']['coverUrl']res_img=requests.get(url=img_url,stream=True)withopen(f'image/img_{i}.jpg','wb')asfile:file.write(res_img.content)delres_img# 释放响应对象defdownload_video(i,item):# 视频下载video_url=item['photo']['photoUrl']res_video=requests.get(url=video_url)withopen(f'video/video_{i}.mp4','wb')asfile:file.write(res_video.content)# ============================================# 五、主程序: 循环采集与下载# ============================================ky_all_data=[]whileTrue:ky_data=send_post()# 当feeds为空时, 说明已到达最后一页, 退出循环ifnotky_data:print("所有数据已采集完毕!")breakfori,iteminenumerate(ky_data):# 1. 提取结构化数据kuaishou=kuaishou_data(item)ky_all_data.append(kuaishou)# 2. 下载封面图(间隔1秒, 防止请求过快)time.sleep(1)download_img(i,item)# 3. 下载视频time.sleep(1)download_video(i,item)time.sleep(1)# ============================================# 六、数据持久化: 导出Excel# ============================================df=pd.DataFrame(ky_all_data)df.to_excel('ky.xlsx',index=False)print(f"共采集{len(ky_all_data)}条数据, 已保存至 ky.xlsx")

4.2 代码架构图解

+-----------------+ | 配置请求头 | <- Cookie/User-Agent伪装 +-----------------+ | 构造GraphQL体 | <- Query + Variables + Fragments +-----------------+ | 初始化目录 | <- image/ video/ +-----------------+ | while循环翻页 | <- pcursor游标驱动 +-----------------+ | +-----------+ | | | 发送POST | | | +-----------+ | | | 提取feeds | | | +-----------+ | | | 更新游标 | | | +-----------+ | | | 遍历下载 | | | | - 封面图 | | | | - 视频 | | | | - 元数据 | | | +-----------+ | +-----------------+ | 导出Excel表格 | <- pandas.DataFrame.to_excel() +-----------------+

五、关键技术点详解

5.1 GraphQL查询语句解析

快手使用了Fragment(片段)机制来复用字段定义:

photoContent: 定义视频基础字段(标题、点赞、URL等)
recoPhotoFragment: 推荐视频专用字段(与photoContent结构相同)
feedContent: 聚合作者信息和视频信息

这种设计使得单次查询即可获取视频+作者的完整信息, 避免了REST API的多轮请求。

5.2 分页游标机制

与传统page=1,2,3的分页不同, 快手采用游标分页(Cursor-based Pagination):

# 首次请求: pcursor为空json_data['variables']['pcursor']=""# 响应中获取下一页游标pcursor=response['data']['visionProfilePhotoList']['pcursor']# 下次请求: 传入游标json_data['variables']['pcursor']=pcursor

优势: 数据插入/删除时不会导致数据重复或遗漏, 适合实时性强的数据流。

5.3 流式下载优化

res_img=requests.get(url=img_url,stream=True)

设置stream=True后, 数据以流式逐块下载, 而非一次性加载到内存, 大幅降低内存占用, 适合大文件下载场景。

六、运行效果展示

6.1 控制台输出

{'标题': '今天的生活记录...', '点赞数量': 15234, '图片地址': 'https://...', '视频地址': 'https://...'} {'标题': '美食制作教程...', '点赞数量': 8921, '图片地址': 'https://...', '视频地址': 'https://...'} ... 所有数据已采集完毕! 共采集 42 条数据, 已保存至 ky.xlsx

6.2 文件结构

project/ ├── ky.xlsx # 结构化数据表格 ├── image/ │ ├── img_0.jpg │ ├── img_1.jpg │ └── ... └── video/ ├── video_0.mp4 ├── video_1.mp4 └── ...

6.3 Excel数据预览

标题	点赞数量	图片地址	视频地址
今天的生活记录…	15234	https://…	https://…
美食制作教程…	8921	https://…	https://…
…	…	…	…

七、常见问题与解决方案

7.1 Cookie过期

现象: 返回401 Unauthorized或空数据

解决:

重新登录快手网页版
开发者工具 → Application → Cookies → 复制最新Cookie
更新headers['Cookie']字段

7.2 请求频率限制

现象: 返回429 Too Many Requests

解决:

增大time.sleep()间隔(建议2-3秒)
使用代理IP池轮换
添加随机延时:time.sleep(random.uniform(1, 3))

7.3 视频下载失败

现象: 视频文件大小为0KB或无法播放

解决:

检查photoUrl是否为有效URL
部分视频可能需要携带Referer请求头
尝试使用photoH265Url作为备用地址

八、进阶拓展

8.1 多博主批量采集

user_ids=['3xb4sru7rrgesjm','3xabc123','3xdef456']foruidinuser_ids:json_data['variables']['userId']=uid# 执行采集逻辑...

8.2 数据库存储替代Excel

importsqlite3 conn=sqlite3.connect('kuaishou.db')df.to_sql('videos',conn,if_exists='append',index=False)

8.3 异步加速下载

使用aiohttp+asyncio实现并发下载, 提升10倍以上效率:

importaiohttpimportasyncioasyncdefdownload_async(url,path):asyncwithaiohttp.ClientSession()assession:asyncwithsession.get(url)asresp:withopen(path,'wb')asf:f.write(awaitresp.read())