当前位置：首页 > news >正文

Python小红书数据采集终极指南：xhs库完整使用教程与实战案例

news 2026/5/15 7:55:59

Python小红书数据采集终极指南：xhs库完整使用教程与实战案例

【免费下载链接】xhs基于小红书 Web 端进行的请求封装。https://reajason.github.io/xhs/项目地址: https://gitcode.com/gh_mirrors/xh/xhs

小红书作为国内领先的生活方式分享平台，蕴含着海量的用户生成内容和商业价值数据。小红书数据采集工具xhs正是为开发者量身打造的Python爬虫库，通过封装小红书Web端API接口，帮助用户高效获取公开内容数据。本指南将为你全面解析从环境搭建到高级应用的全过程，让你轻松掌握小红书数据采集的核心技术。

🎯 为什么选择xhs进行小红书数据采集？

在数据驱动的时代，小红书平台的数据对于市场分析、竞品研究、内容创作等领域具有重要价值。xhs工具的核心优势在于其高效稳定的API封装和智能签名机制，能够帮助开发者：

市场趋势洞察：实时获取热门话题和消费趋势数据
竞品策略分析：监控竞争对手的产品推广和营销活动
内容优化参考：分析爆款笔记的特征规律和用户偏好
用户画像构建：基于互动数据构建精准的用户兴趣标签

项目架构与核心模块

xhs项目的核心功能模块位于xhs/目录下，其中xhs/core.py包含了主要的API封装功能。项目的模块化设计确保了代码的可维护性和扩展性：

模块名称	功能描述	重要性
core.py	核心API封装，包含所有数据获取方法	⭐⭐⭐⭐⭐
help.py	辅助功能模块，提供数据处理工具	⭐⭐⭐⭐
exception.py	异常处理模块，定义各种错误类型	⭐⭐⭐
example/	使用示例目录，包含多种应用场景	⭐⭐⭐⭐

🚀 5分钟快速部署与基础使用

环境安装与配置

安装xhs工具仅需一条命令，支持多种安装方式：

# 通过PyPI安装稳定版本 pip install xhs # 从源码安装最新版本 git clone https://gitcode.com/gh_mirrors/xh/xhs cd xhs && python setup.py install # 验证安装 python -c "import xhs; print(xhs.__version__)"

系统要求：

Python 3.8或更高版本
requests库（自动安装）
lxml库（自动安装）
playwright（用于签名服务）

获取必要凭证

使用xhs工具需要小红书的cookie信息，以下是获取步骤：

浏览器登录：在Chrome或Edge浏览器中登录小红书
打开开发者工具：按F12打开开发者工具
查找Cookie：切换到Application或Storage标签页
复制关键字段：找到并复制a1、web_session、webId等字段

编写第一个采集脚本

创建first_crawl.py文件，开始你的数据采集之旅：

from xhs import XhsClient import json # 初始化客户端（请替换为你的实际cookie） cookie = "your_cookie_here" client = XhsClient(cookie=cookie) # 搜索热门笔记 search_results = client.search_note( keyword="美食探店", page=1, page_size=10 ) print(f"找到 {len(search_results['items'])} 条相关笔记") print(json.dumps(search_results['items'][0], indent=2, ensure_ascii=False))

📊 核心功能深度解析与高级配置

智能搜索功能详解

xhs提供了强大的搜索功能，支持多种搜索参数和排序方式：

# 多维度搜索示例 def advanced_search_example(): """高级搜索功能演示""" client = XhsClient(cookie="your_cookie") # 1. 按热度排序搜索 hot_notes = client.search_note( keyword="旅行攻略", sort_type="hot", # 按热度排序 page=1, page_size=15 ) # 2. 按时间排序搜索 new_notes = client.search_note( keyword="美妆教程", sort_type="time", # 按时间排序 page=1, page_size=15 ) # 3. 组合搜索条件 filtered_notes = client.search_note( keyword="健身", page=1, page_size=20, # 其他可选参数 # note_type="video", # 仅视频笔记 # sort="general", # 综合排序 ) return hot_notes, new_notes, filtered_notes

用户数据分析与内容获取

获取指定用户的详细信息和发布内容：

def get_user_insights(user_id): """获取用户深度数据""" client = XhsClient(cookie="your_cookie") # 用户基本信息 user_info = client.get_user_info(user_id=user_id) # 用户发布的笔记 user_notes = client.get_user_notes( user_id=user_id, page=1, page_size=20 ) # 用户收藏的笔记 user_favorites = client.get_user_favorites( user_id=user_id, page=1, page_size=15 ) # 数据分析 total_notes = len(user_notes['items']) avg_likes = sum(note['likes'] for note in user_notes['items']) / total_notes return { "user_info": user_info, "total_notes": total_notes, "avg_likes": avg_likes, "notes_sample": user_notes['items'][:3] }

笔记详情与多媒体内容提取

获取单篇笔记的完整信息，包括图片、视频等多媒体内容：

from xhs import help def extract_note_content(note_id, xsec_token): """提取笔记完整内容""" client = XhsClient(cookie="your_cookie") # 获取笔记详情 note_detail = client.get_note_by_id( note_id=note_id, xsec_token=xsec_token ) # 提取图片链接 image_urls = help.get_imgs_url_from_note(note_detail) # 提取视频链接 video_urls = help.get_video_url_from_note(note_detail) # 提取文本内容 content = note_detail.get('note', {}).get('content', '') return { "title": note_detail.get('note', {}).get('title', ''), "content": content, "images": image_urls, "videos": video_urls, "stats": { "likes": note_detail.get('note', {}).get('likes', 0), "collects": note_detail.get('note', {}).get('collects', 0), "comments": note_detail.get('note', {}).get('comments', 0) } }

🛠️ 签名服务配置与高级功能

独立签名服务部署

为了应对小红书的签名验证机制，xhs提供了签名服务方案。参考example/basic_sign_server.py配置独立签名服务：

# 签名服务配置示例 def setup_sign_server(): """配置独立签名服务""" # 1. 安装必要依赖 # pip install playwright # playwright install chromium # 2. 准备stealth.min.js文件 # 可以从项目中获取或自行配置 # 3. 启动签名服务 # 参考example/basic_sign_server.py # 4. 客户端配置 sign_server_url = "http://localhost:5000/sign" def custom_sign(uri, data=None, a1="", web_session=""): import requests response = requests.post( sign_server_url, json={"uri": uri, "data": data, "a1": a1, "web_session": web_session} ) return response.json() return custom_sign

错误处理与重试机制

完善的错误处理是数据采集稳定性的关键：

import time import random from xhs.exception import DataFetchError, IPBlockError class SafeXhsClient: """带重试机制的Xhs客户端""" def __init__(self, cookie, max_retries=3): self.client = XhsClient(cookie=cookie) self.max_retries = max_retries def safe_api_call(self, api_func, *args, **kwargs): """安全的API调用""" for attempt in range(self.max_retries): try: return api_func(*args, **kwargs) except DataFetchError as e: print(f"数据获取失败 (尝试 {attempt+1}/{self.max_retries}): {e}") if attempt < self.max_retries - 1: wait_time = random.uniform(2, 5) print(f"等待 {wait_time:.1f} 秒后重试...") time.sleep(wait_time) except IPBlockError: print("IP可能被限制，建议更换IP或稍后再试") break except Exception as e: print(f"未知错误: {e}") break return None def search_with_retry(self, keyword, **kwargs): """带重试的搜索功能""" return self.safe_api_call(self.client.search_note, keyword, **kwargs)

📈 实战应用场景与最佳实践

场景一：市场调研与竞品分析

假设你是一家美妆品牌的市场分析师，需要监控竞品推广策略：

def competitive_analysis(): """竞品分析实战""" client = SafeXhsClient(cookie="your_cookie") # 1. 收集竞品关键词 competitors = ["品牌A", "品牌B", "品牌C"] keywords = ["口红", "眼影", "粉底液", "护肤品"] # 2. 批量采集数据 all_data = [] for brand in competitors: for keyword in keywords: search_term = f"{brand} {keyword}" results = client.search_with_retry( keyword=search_term, page=1, page_size=20, sort_type="hot" ) if results: all_data.append({ "brand": brand, "keyword": keyword, "notes": results['items'], "total": len(results['items']) }) # 3. 数据分析 analysis_results = analyze_competition_data(all_data) return analysis_results def analyze_competition_data(data): """分析竞品数据""" # 计算各品牌曝光量 brand_exposure = {} for item in data: brand = item['brand'] brand_exposure[brand] = brand_exposure.get(brand, 0) + item['total'] # 分析热门内容类型 content_types = {} for item in data: for note in item['notes']: note_type = note.get('type', 'unknown') content_types[note_type] = content_types.get(note_type, 0) + 1 return { "brand_exposure": brand_exposure, "content_types": content_types, "total_notes": sum(item['total'] for item in data) }

场景二：内容创作与热点追踪

内容创作者可以使用xhs工具优化创作策略：

class ContentStrategyAnalyzer: """内容策略分析器""" def __init__(self, cookie): self.client = SafeXhsClient(cookie) def analyze_trending_topics(self, category="美妆"): """分析热门话题趋势""" # 获取热门笔记 hot_notes = self.client.search_with_retry( keyword=category, sort_type="hot", page=1, page_size=50 ) # 提取关键词 keywords = self.extract_keywords(hot_notes) # 分析发布时间规律 post_times = self.analyze_post_time(hot_notes) # 分析内容形式偏好 content_formats = self.analyze_content_format(hot_notes) return { "trending_keywords": keywords[:10], "optimal_post_times": post_times, "preferred_formats": content_formats } def extract_keywords(self, notes): """从笔记中提取高频关键词""" # 简单的关键词提取逻辑 all_words = [] for note in notes['items']: title = note.get('title', '') content = note.get('content', '') all_words.extend(title.split()) all_words.extend(content.split()) from collections import Counter word_counts = Counter(all_words) return [word for word, count in word_counts.most_common(20)]

场景三：学术研究与数据分析

学术研究者可以利用xhs数据进行社交网络分析：

import networkx as nx import matplotlib.pyplot as plt class SocialNetworkAnalyzer: """社交网络分析器""" def __init__(self, cookie): self.client = XhsClient(cookie=cookie) self.graph = nx.Graph() def build_user_network(self, seed_user_id, depth=2): """构建用户社交网络""" visited = set() queue = [(seed_user_id, 0)] while queue: user_id, current_depth = queue.pop(0) if user_id in visited or current_depth > depth: continue visited.add(user_id) # 获取用户关注的用户 try: following = self.client.get_user_following(user_id=user_id) for followed_user in following['items']: followed_id = followed_user['user_id'] self.graph.add_edge(user_id, followed_id) if followed_id not in visited: queue.append((followed_id, current_depth + 1)) except Exception as e: print(f"获取用户 {user_id} 的关注列表失败: {e}") return self.graph def analyze_network_metrics(self): """分析网络指标""" metrics = { "节点数": self.graph.number_of_nodes(), "边数": self.graph.number_of_edges(), "平均度": sum(dict(self.graph.degree()).values()) / self.graph.number_of_nodes(), "聚类系数": nx.average_clustering(self.graph), "直径": nx.diameter(self.graph) if nx.is_connected(self.graph) else "不连通" } return metrics

⚡ 性能优化与扩展方案

并发处理与批量采集

对于大规模数据采集，建议使用并发处理提高效率：

import concurrent.futures from typing import List class BatchCollector: """批量数据采集器""" def __init__(self, cookie, max_workers=5): self.client = XhsClient(cookie=cookie) self.max_workers = max_workers def collect_notes_batch(self, note_ids: List[str]) -> List[dict]: """批量采集笔记信息""" results = [] with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor: # 提交所有任务 future_to_note = { executor.submit(self.client.get_note_by_id, note_id, ""): note_id for note_id in note_ids } # 处理完成的任务 for future in concurrent.futures.as_completed(future_to_note): note_id = future_to_note[future] try: result = future.result() results.append(result) print(f"成功采集笔记: {note_id}") except Exception as e: print(f"采集笔记 {note_id} 失败: {e}") return results def search_multiple_keywords(self, keywords: List[str]) -> dict: """并发搜索多个关键词""" keyword_results = {} with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: future_to_keyword = { executor.submit(self.client.search_note, keyword, page=1, page_size=20): keyword for keyword in keywords } for future in concurrent.futures.as_completed(future_to_keyword): keyword = future_to_keyword[future] try: result = future.result() keyword_results[keyword] = result['items'] except Exception as e: print(f"搜索关键词 '{keyword}' 失败: {e}") return keyword_results

数据缓存与持久化存储

减少重复请求，提高数据采集效率：

import json import os import hashlib from datetime import datetime, timedelta class DataCache: """数据缓存管理器""" def __init__(self, cache_dir="data_cache", expiry_hours=24): self.cache_dir = cache_dir self.expiry_hours = expiry_hours os.makedirs(cache_dir, exist_ok=True) def get_cache_key(self, func_name, *args, **kwargs): """生成缓存键""" key_str = f"{func_name}_{args}_{kwargs}" return hashlib.md5(key_str.encode()).hexdigest() def is_cache_valid(self, cache_file): """检查缓存是否有效""" if not os.path.exists(cache_file): return False with open(cache_file, 'r', encoding='utf-8') as f: cache_data = json.load(f) cache_time = datetime.fromisoformat(cache_data['timestamp']) return datetime.now() - cache_time < timedelta(hours=self.expiry_hours) def cached_call(self, func, *args, **kwargs): """带缓存的函数调用""" cache_key = self.get_cache_key(func.__name__, *args, **kwargs) cache_file = os.path.join(self.cache_dir, f"{cache_key}.json") # 检查缓存 if os.path.exists(cache_file) and self.is_cache_valid(cache_file): with open(cache_file, 'r', encoding='utf-8') as f: cache_data = json.load(f) print(f"使用缓存数据: {cache_key}") return cache_data['data'] # 调用函数并缓存结果 result = func(*args, **kwargs) cache_data = { 'timestamp': datetime.now().isoformat(), 'data': result } with open(cache_file, 'w', encoding='utf-8') as f: json.dump(cache_data, f, ensure_ascii=False, indent=2) print(f"缓存新数据: {cache_key}") return result

数据库存储方案

对于大规模数据采集，建议使用数据库存储：

import sqlite3 import pandas as pd from contextlib import contextmanager class DatabaseManager: """数据库管理器""" def __init__(self, db_path="xhs_data.db"): self.db_path = db_path self.init_database() def init_database(self): """初始化数据库表结构""" with self.get_connection() as conn: conn.execute(''' CREATE TABLE IF NOT EXISTS notes ( id TEXT PRIMARY KEY, title TEXT, content TEXT, user_id TEXT, likes INTEGER, collects INTEGER, comments INTEGER, create_time TIMESTAMP, update_time TIMESTAMP, raw_data TEXT ) ''') conn.execute(''' CREATE TABLE IF NOT EXISTS users ( user_id TEXT PRIMARY KEY, nickname TEXT, avatar TEXT, notes_count INTEGER, fans_count INTEGER, following_count INTEGER, update_time TIMESTAMP ) ''') @contextmanager def get_connection(self): """获取数据库连接""" conn = sqlite3.connect(self.db_path) try: yield conn conn.commit() finally: conn.close() def save_note(self, note_data): """保存笔记数据""" with self.get_connection() as conn: conn.execute(''' INSERT OR REPLACE INTO notes (id, title, content, user_id, likes, collects, comments, create_time, update_time, raw_data) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ''', ( note_data.get('id'), note_data.get('title'), note_data.get('content'), note_data.get('user_id'), note_data.get('likes', 0), note_data.get('collects', 0), note_data.get('comments', 0), note_data.get('create_time'), datetime.now().isoformat(), json.dumps(note_data, ensure_ascii=False) ))

🔧 故障排除与常见问题

Q1: 签名失败怎么办？

解决方案：

检查cookie有效性：确保cookie未过期，重新获取最新cookie
配置签名服务：参考example/basic_sign_server.py配置独立签名服务
调整等待时间：在签名函数中适当增加sleep时间
更新stealth.min.js：确保使用最新版本的stealth脚本

Q2: 请求频率过高被限制？

预防措施：

添加延迟：在请求间添加随机延迟（2-5秒）
使用代理池：轮换使用多个IP地址
限制并发数：控制同时进行的请求数量
监控响应：检测异常响应并及时调整策略

Q3: 如何提高数据采集稳定性？

最佳实践：

# 稳定的采集策略示例 class StableCollector: def __init__(self): self.retry_count = 0 self.max_retries = 5 def collect_with_strategy(self, collect_func): """带策略的数据采集""" strategies = [ self._try_direct_collect, self._try_with_delay, self._try_with_proxy, self._try_with_rotated_cookie ] for strategy in strategies: result = strategy(collect_func) if result: return result return None

Q4: 数据格式不一致如何处理？

数据清洗方案：

def clean_note_data(raw_note): """清洗笔记数据""" cleaned = { 'id': raw_note.get('id', ''), 'title': raw_note.get('title', '').strip(), 'content': raw_note.get('content', '').strip(), 'user_id': raw_note.get('user', {}).get('user_id', ''), 'stats': { 'likes': int(raw_note.get('likes', 0) or 0), 'collects': int(raw_note.get('collects', 0) or 0), 'comments': int(raw_note.get('comments', 0) or 0) }, 'timestamp': raw_note.get('time', ''), 'images': raw_note.get('images', []), 'videos': raw_note.get('videos', []) } # 处理可能的None值 for key in ['title', 'content']: if cleaned[key] is None: cleaned[key] = '' return cleaned

📊 性能对比与优化建议

xhs工具性能特点

功能模块	性能表现	优化建议
搜索功能	⭐⭐⭐⭐ 响应快速	使用缓存减少重复搜索
用户数据获取	⭐⭐⭐⭐ 稳定性好	批量获取减少请求次数
笔记详情	⭐⭐⭐ 受签名影响	配置独立签名服务
并发处理	⭐⭐⭐⭐ 支持良好	控制并发数避免限制

性能优化技巧

请求合并：将多个小请求合并为批量请求
数据缓存：使用Redis或本地缓存存储频繁访问的数据
连接池：复用HTTP连接减少建立连接的开销
异步处理：使用asyncio提高I/O密集型任务效率

import asyncio import aiohttp class AsyncXhsClient: """异步Xhs客户端""" async def fetch_note_async(self, session, note_id): """异步获取笔记""" url = f"https://www.xiaohongshu.com/explore/{note_id}" async with session.get(url) as response: return await response.text() async def batch_fetch_notes(self, note_ids): """批量异步获取笔记""" async with aiohttp.ClientSession() as session: tasks = [self.fetch_note_async(session, note_id) for note_id in note_ids] return await asyncio.gather(*tasks, return_exceptions=True)

🚀 扩展与定制开发

自定义数据处理器

class CustomDataProcessor: """自定义数据处理器""" def __init__(self): self.processors = { 'text': self.process_text, 'image': self.process_image, 'video': self.process_video, 'user': self.process_user } def process_text(self, text_data): """处理文本数据""" # 文本清洗、分词、情感分析等 return { 'cleaned_text': text_data.strip(), 'word_count': len(text_data.split()), 'sentiment': self.analyze_sentiment(text_data) } def process_image(self, image_urls): """处理图片数据""" # 图片下载、特征提取等 return { 'urls': image_urls, 'count': len(image_urls), 'downloaded': self.download_images(image_urls) } def add_custom_processor(self, data_type, processor_func): """添加自定义处理器""" self.processors[data_type] = processor_func

插件系统设计

class PluginSystem: """插件系统""" def __init__(self): self.plugins = {} def register_plugin(self, name, plugin): """注册插件""" self.plugins[name] = plugin def process_data(self, data_type, data): """通过插件处理数据""" if data_type in self.plugins: return self.plugins[data_type].process(data) return data # 示例插件 class SentimentAnalysisPlugin: def process(self, text): # 情感分析逻辑 return {"sentiment": "positive", "score": 0.85} class ImageAnalysisPlugin: def process(self, images): # 图片分析逻辑 return {"has_faces": True, "dominant_color": "#FF5733"}

🎯 总结与最佳实践建议

核心要点总结

环境配置：确保Python环境正确，安装所有依赖
凭证管理：妥善保管cookie，定期更新
签名服务：对于生产环境，建议部署独立签名服务
错误处理：实现完善的错误处理和重试机制
性能优化：使用缓存、并发和批量处理提高效率

部署架构建议

对于企业级应用，建议采用以下架构：

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ 数据采集层 │ │ 数据处理层 │ │ 数据存储层 │ │ │ │ │ │ │ │ • 分布式爬虫 │───▶│ • 数据清洗 │───▶│ • 关系数据库 │ │ • 代理池管理 │ │ • 特征提取 │ │ • 时序数据库 │ │ • 签名服务集群 │ │ • 情感分析 │ │ • 对象存储 │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ 监控告警层 │ │ 数据分析层 │ │ 应用接口层 │ │ │ │ │ │ │ │ • 运行状态监控 │ │ • 趋势分析 │ │ • REST API │ │ • 异常检测 │ │ • 报表生成 │ │ • WebSocket │ │ • 自动恢复 │ │ • 预测模型 │ │ • GraphQL │ └─────────────────┘ └─────────────────┘ └─────────────────┘