当前位置：首页 > news >正文

影刀RPA新手教程：图片批量处理完全指南——下载保存、重命名、压缩与格式转换

news 2026/6/29 13:41:24

影刀RPA新手教程：图片批量处理完全指南——下载保存、重命名、压缩与格式转换

本文作者：林焱 | 转载请注明出处

开篇案例：爬了5000张商品图片，文件名全是乱码

去年做一个电商商品数据采集项目，需要把每个商品的图片下载下来，按"商品ID.jpg"命名。

我写好流程，跑了5000个商品，下载了8000多张图片（有些商品有多张图）。

打开文件夹一看：文件名全是spu_123456_1.jpg这种，和产品部的命名规范（“品牌-型号-颜色.jpg”）对不上。

更严重的是，有些图片下载失败了（URL失效），文件夹里出现了0KB的损坏文件。

还有，有些图片是WebP格式，产品部说他们的系统不支持WebP，要转成JPG。

这次经历让我意识到：图片批量处理，不只是"下载"那么简单，还包括重命名、格式转换、压缩、去损坏文件等一系列操作。

本文所有案例，围绕"电商商品图片批量下载与处理"这条真实业务线展开。

模块一：安装与准备工作

图片批量处理，需要以下Python库：

pip install requests Pillow opencv-python

requests：下载图片
Pillow：图片格式转换、压缩、获取图片尺寸
opencv-python：高级图片处理（可选）

如果安装opencv-python报错，可以只装Pillow，90%的图片处理需求Pillow都能搞定。

环境配置详细步骤在 home.linyan.cloud 有图文教程。

新建流程，命名为"图片批量处理Demo"。

模块二：元素定位（从网页提取图片URL）

图片URL的提取，是图片批量处理的第一步。

XPath提取图片URL

//img[@class='product-img']/@src

这个XPath提取class为product-img的img标签的src属性。

处理相对URL

有些网页里图片URL是相对路径，比如//img/123.jpg。

需要转成绝对URL才能下载：

importurllib.parsedefmake_absolute_url(relative_url,base_url):""" 把相对URL转成绝对URL """returnurllib.parse.urljoin(base_url,relative_url)# 示例base="https://www.example.com/products/123.html"relative="//img/123.jpg"absolute=make_absolute_url(relative,base)print(absolute)# https://www.example.com/img/123.jpg

注意：//开头的URL是协议相对URL，需要用urlparse处理：

fromurllib.parseimporturlparsedeffix_protocol_relative_url(url,default_scheme="https"):ifurl.startswith("//"):[video(video-dTLSwsTs-1782670305046)(type-csdn)(url-https://live.csdn.net/v/embed/525000)(image-https://v-blog.csdnimg.cn/asset/23da3fe1f67a47106d725406cfde9a97/cover/Cover0.jpg)(title-拼多多店群自动化上架方案)]returndefault_scheme+":"+urlreturnurl

模块三：变量与数据类型（图片处理的变量管理）

图片处理流程的核心变量：

# 配置变量（存在影刀的变量面板里）image_save_dir="C:/ProductImages"# 图片保存目录image_format="jpg"# 目标格式max_width=800# 压缩后的最大宽度quality=85# JPG压缩质量（1-100）# 运行时变量download_count=0# 已下载数量fail_count=0# 失败数量skip_count=0# 跳过数量（已存在）

在影刀里，这些变量在"变量"面板里定义，流程里随时读取和修改。

模块四：流程控制（图片下载主循环）

图片批量下载的主循环：

对每一个商品： 1. 提取图片URL列表 2. 对每一张图片： 2.1 检查本地是否已存在（按文件名判断） 2.2 如果已存在，跳过 2.3 如果不存在，下载 2.4 下载完成后验证文件是否有效（不是0KB，不是损坏文件） 2.5 如果无效，删除并标记失败 3. 重命名图片 4. 格式转换（如果需要） 5. 压缩（如果需要）

在影刀里用两层循环实现：外层循环遍历商品，内层循环遍历每个商品的图片。

模块五：网页自动化（滚动加载图片）

很多网页的图片是"懒加载"的：只有滚动到可视区域，图片才会加载，src才会从占位符变成真实URL。

处理懒加载的方法

# 在影刀的"执行JavaScript"指令里js_scroll=""" // 滚动到页面底部 window.scrollTo(0, document.body.scrollHeight); return document.body.scrollHeight; """# 配合循环使用：# 1. 执行上面的JS，滚动到底部# 2. 等待2秒，让图片加载# 3. 再提取图片URL# 4. 如果滚动后的高度和之前一样，说明到底了，退出循环

我当时踩过这个坑：没处理懒加载，提取到的图片URL全是占位符（data:image/gif;base64,...）。

这种占位符不是真实图片URL，需要滚动后才能拿到真实URL。

模块六：数据处理——图片下载

用requests下载图片，要考虑超时、重试、和User-Agent。

importrequestsimportosimporttimedefdownload_image(url,save_path,max_retry=3):""" 下载图片，带重试 """headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36","Referer":url# 有些网站检查Referer，防盗链}foriinrange(max_retry):try:resp=requests.get(url,headers=headers,timeout=30,stream=True)resp.raise_for_status()# 检查Content-Type，确保是图片content_type=resp.headers.get("Content-Type","")ifnotcontent_type.startswith("image"):print(f"不是图片：{url}，Content-Type:{content_type}")returnFalse# 保存文件（用stream模式，适合大文件）withopen(save_path,"wb")asf:forchunkinresp.iter_content(chunk_size=8192):f.write(chunk)# 验证文件大小file_size=os.path.getsize(save_path)iffile_size<1024:# 小于1KB，可能是损坏文件print(f"文件太小，可能损坏：{save_path}（{file_size}字节）")os.remove(save_path)returnFalseprint(f"下载成功：{url}->{save_path}")returnTrueexceptrequests.exceptions.Timeout:print(f"下载超时，第{i+1}次重试：{url}")time.sleep(2)exceptExceptionase:print(f"下载失败：{url}，错误：{e}")ifos.path.exists(save_path):os.remove(save_path)returnFalsereturnFalse

模块七：数据处理——图片重命名

下载后的图片，按业务规则重命名。

importosimportredefsanitize_filename(name):""" 清理文件名中的非法字符（Windows） Windows不允许：/ \ : * ? " < > | """illegal_chars=r'[\\/*?:"<>|]'returnre.sub(illegal_chars,"_",name)defrename_images(image_dir,naming_rule):""" 批量重命名图片 naming_rule: 函数，接收(商品数据)，返回文件名 """forfilenameinos.listdir(image_dir):ifnotfilename.endswith((".jpg",".png",".webp",".jpeg")):continueold_path=os.path.join(image_dir,filename)# 解析文件名，获取商品数据（这里假设文件名里包含了商品ID）# 实际项目中，商品数据可能从数据库或Excel里读取product_id=filename.split("_")[1]# 示例：spu_123456_1.jpg -> 123456# 生成新文件名new_name=f"{product_id}.jpg"new_path=os.path.join(image_dir,new_name)# 重命名ifold_path!=new_path:os.rename(old_path,new_path)print(f"重命名：{filename}->{new_name}")# 更实用的版本：从商品数据字典里获取信息来命名defrename_by_product_info(image_path,product_info):""" 按商品信息重命名图片 product_info: {"id": "123456", "brand": "Apple", "model": "iPhone15", "color": "黑色"} """brand=product_info.get("brand","未知")model=product_info.get("model","未知")color=product_info.get("color","未知")# 清理文件名中的非法字符safe_name=f"{brand}-{model}-{color}.jpg"safe_name=sanitize_filename(safe_name)new_path=os.path.join(os.path.dirname(image_path),safe_name)os.rename(image_path,new_path)returnnew_path

模块八：数据处理——图片压缩

电商图片往往很大（单张2-3MB），上传到某些平台有大小限制，需要压缩。

用Pillow压缩：

fromPILimportImageimportosdefcompress_image(input_path,output_path=None,max_width=800,quality=85):""" 压缩图片：缩放 + 降低质量 max_width: 最大宽度（像素），超过则等比缩放 quality: JPG质量（1-100），85是质量和大小的平衡点 """ifoutput_pathisNone:output_path=input_pathtry:img=Image.open(input_path)# 转RGB（处理RGBA格式的图片，如PNG带透明通道的）ifimg.modein("RGBA","P"):img=img.convert("RGB")# 缩放width,height=img.sizeifwidth>max_width:ratio=max_width/width new_height=int(height*ratio)img=img.resize((max_width,new_height),Image.LANCZOS)print(f"缩放：{width}x{height}->{max_width}x{new_height}")# 保存（压缩）img.save(output_path,"JPEG",quality=quality,optimize=True)# 打印压缩效果old_size=os.path.getsize(input_path)new_size=os.path.getsize(output_path)ratio=(1-new_size/old_size)*100print(f"压缩完成：{old_size/1024:.1f}KB ->{new_size/1024:.1f}KB（缩小{ratio:.1f}%）")returnTrueexceptExceptionase:print(f"压缩失败：{input_path}，错误：{e}")returnFalse

模块九：鼠标键盘与图像操作（验证码图片处理）

下载图片时遇到验证码，需要处理。

如果验证码是图片，可以用截图+OCR，参考前面文章的内容。

这里补充一个场景：有些网站把图片URL用JS动态生成，直接用XPath提取不到。

用浏览器开发者工具的"网络"面板抓取图片URL

在影刀里用"执行JavaScript"指令，从页面的全局变量里读取图片URL：

// 在影刀的"执行JavaScript"指令里运行// 有些网站把图片URL存在window对象的某个属性里varurls=[];if(window.__INITIAL_STATE__&&window.__INITIAL_STATE__.product){varproduct=window.__INITIAL_STATE__.product;if(product.images){urls=product.images.map(function(img){returnimg.url;});}}returnJSON.stringify(urls);

模块十：进阶技能

技能一：WebP转JPG

有些网站（特别是电商网站）用WebP格式，因为体积比JPG小30%。

但有些老旧系统不支持WebP，需要转成JPG。

fromPILimportImagedefwebp_to_jpg(webp_path,jpg_path=None):""" WebP转JPG """ifjpg_pathisNone:jpg_path=webp_path.rsplit(".",1)[0]+".jpg"try:img=Image.open(webp_path)# WebP可能是RGBA（带透明），转JPG需要先转RGBifimg.mode=="RGBA":# 创建白色背景background=Image.new("RGB",img.size,(255,255,255))background.paste(img,mask=img.split()[3])# 用alpha通道做遮罩background.save(jpg_path,"JPEG",quality=90)else:img.convert("RGB").save(jpg_path,"JPEG",quality=90)print(f"WebP转JPG完成：{webp_path}->{jpg_path}")returnTrueexceptExceptionase:print(f"转换失败：{webp_path}，错误：{e}")returnFalse

技能二：批量格式转换

importosfromPILimportImagedefbatch_convert_format(input_dir,output_dir,target_format="JPG"):""" 批量转换图片格式 target_format: JPG, PNG, WEBP等 """os.makedirs(output_dir,exist_ok=True)converted=0failed=0forfilenameinos.listdir(input_dir):input_path=os.path.join(input_dir,filename)ifnotos.path.isfile(input_path):continue# 生成输出文件名name,ext=os.path.splitext(filename)output_path=os.path.join(output_dir,f"{name}.{target_format.lower()}")try:img=Image.open(input_path)ifimg.modein("RGBA","P")andtarget_format=="JPG":img=img.convert("RGB")img.save(output_path,target_format)converted+=1exceptExceptionase:print(f"转换失败：{filename}，错误：{e}")failed+=1print(f"批量转换完成：成功{converted}张，失败{failed}张")

技能三：生成缩略图

fromPILimportImagedefcreate_thumbnail(image_path,thumbnail_path,size=(200,200)):""" 生成缩略图（保持比例） """img=Image.open(image_path)img.thumbnail(size)# thumbnail会保持宽高比img.save(thumbnail_path)print(f"缩略图已生成：{thumbnail_path}")

模块十一：平台实战

把图片处理流程部署到影刀控制台时，注意以下几点。

要点一：图片保存路径用绝对路径

影刀控制台执行流程时，当前目录可能不是你期望的目录。

所有图片保存路径用绝对路径，或者用配置变量。

要点二：图片处理是CPU密集型任务，不要并发太多

图片压缩和格式转换很耗CPU。

如果有多个机器人同时跑图片处理，会导致服务器CPU占用100%。

解决方法：在流程里加一个"系统CPU检查"步骤，CPU占用超过80%时等待：

importpsutilimporttimedefwait_for_cpu_free(threshold=80):""" 等待CPU占用降到阈值以下 """whilepsutil.cpu_percent(interval=1)>threshold:print(f"CPU占用过高，等待...")time.sleep(5)

要点三：用任务监控查看图片处理进度

TEMU店群如何管理运营？

在流程里定期输出进度日志，方便在控制台查看：

deflog_progress(current,total):percent=int(current/total*100)print(f"图片处理进度：{current}/{total}（{percent}%）")

模块十二：系统联动与工程化规范

工程化规范一：图片文件按日期或ID分目录存储

不要把所有图片存在同一个文件夹里，文件太多会影响性能。

按日期分目录：

C:/ProductImages/ 2024-06-01/ **** 2024-06-02/ ****

按ID范围分目录：

C:/ProductImages/ 100000-199999/ **** 200000-299999/ ****

importosdefget_image_save_path(product_id,base_dir="C:/ProductImages"):""" 根据商品ID计算保存路径（按ID范围分目录） """id_range=f"{(product_id//100000)*100000}-{(product_id//100000)*100000+99999}"save_dir=os.path.join(base_dir,id_range)os.makedirs(save_dir,exist_ok=True)returnos.path.join(save_dir,f"{product_id}.jpg")

工程化规范二：图片处理记录存数据库

每次处理图片，把结果存到数据库，方便追溯和重新处理：

deflog_image_process(product_id,image_url,status,save_path=None,error_msg=None):""" 记录图片处理结果到SQLite """importsqlite3fromdatetimeimportdatetime conn=sqlite3.connect("C:/RPA_Data/image_log.db")conn.execute(""" CREATE TABLE IF NOT EXISTS image_log ( id INTEGER PRIMARY KEY AUTOINCREMENT, product_id TEXT, image_url TEXT, status TEXT, save_path TEXT, error_msg TEXT, log_time TEXT ) """)now=datetime.now().strftime("%Y-%m-%d %H:%M:%S")conn.execute(""" INSERT INTO image_log (product_id, image_url, status, save_path, error_msg, log_time) VALUES (?, ?, ?, ?, ?, ?) """,(product_id,image_url,status,save_path,error_msg,now))conn.commit()conn.close()

速查表：图片处理常用操作

操作	Pillow函数	说明
打开图片	`Image.open(path)`	支持jpg/png/webp等
保存图片	`img.save(path, format, quality)`	quality仅对JPG有效
缩放	`img.resize((w,h))`	或用img.thumbnail()保持比例
格式转换	`img.convert("RGB").save(...)`	RGBA转RGB才能存JPG
获取尺寸	`img.size`	返回(width, height)元组
获取格式	`img.format`	返回"JPEG"、"PNG"等