当前位置：首页 > news >正文

《AI大模型应用开发实战从入门到精通共60篇》032、图像理解实战：用LLaVA或Qwen-VL分析图片内容

news 2026/6/24 14:37:39

032 图像理解实战：用LLaVA或Qwen-VL分析图片内容

一个让我熬夜的Bug

上周五晚上，客户丢过来一张模糊的监控截图，问“这个货架上的商品标签能识别出来吗？”我随手调了调prompt，把图片塞进Qwen-VL，结果模型一本正经地告诉我：“图片中是一个穿着红色衣服的人在跑步。”——那明明是个静止的货架，红色是促销标签。

这种“幻觉”在图像理解任务里太常见了。模型不是看不懂，而是它“想太多”。后来我换了LLaVA，加了点trick，才把结果拉回来。今天这篇笔记，就聊聊怎么让这两个模型老老实实干活。

环境准备：别在pip上浪费时间

先说坑。LLaVA和Qwen-VL的依赖版本冲突能让你怀疑人生。我建议直接上conda环境，Python 3.10最稳。

conda create-nvl_envpython=3.10conda activate vl_env

LLaVA官方推荐用transformers 4.36.0以上，但别用4.38.0——那个版本有个attention mask的bug，推理时显存会炸。我踩过，直接OOM。

pipinstalltransformers==4.37.2 torch torchvision accelerate pipinstallgit+https://github.com/haotian-liu/LLaVA.git

Qwen-VL更简单，但注意它依赖的tiktoken版本。如果你同时装了其他LLM库，tiktoken版本冲突会导致tokenizer加载失败。解决办法：单独建环境。

pipinstallqwen-vl-utilstransformers==4.37.2

LLaVA实战：从加载到推理

LLaVA的模型加载有个细节：它需要同时加载vision tower和language model。如果你显存只有16G，别用7B版本，老老实实上7B-1.5B的蒸馏版。

fromllava.model.builderimportload_pretrained_modelfromllava.mm_utilsimportget_model_name_from_pathfromllava.eval.run_llavaimporteval_model model_path="liuhaotian/llava-v1.5-7b"# 这里踩过坑：model_name必须和路径匹配，否则加载vision tower会报错model_name=get_model_name_from_path(model_path)tokenizer,model,image_processor,context_len=load_pretrained_model(model_path=model_path,model_base=None,model_name=model_name,device_map="auto"# 别写"cuda:0"，多卡环境会炸)

推理时，prompt设计是关键。别写“请描述这张图片”，模型会给你写一篇散文。要精确控制输出格式。

fromPILimportImageimporttorch image=Image.open("shelf.jpg").convert("RGB")# 别这样写：直接传PIL对象，LLaVA内部会resize，但不会保持宽高比# 正确做法：让image_processor处理image_tensor=image_processor.preprocess(image,return_tensors="pt")["pixel_values"].half().cuda()prompt="USER: <image>\n请列出图片中所有商品的名称，用逗号分隔，不要额外说明。\nASSISTANT:"# 注意：<image>占位符必须单独一行，否则模型会忽略图片input_ids=tokenizer(prompt,return_tensors="pt").input_ids.cuda()withtorch.inference_mode():output_ids=model.generate(input_ids,images=image_tensor,do_sample=False,# 别用True，否则每次结果不一样max_new_tokens=128,temperature=0.1,# 低温度，减少幻觉top_p=0.9)response=tokenizer.decode(output_ids[0][input_ids.shape[1]:],skip_special_tokens=True)print(response)

这里有个坑：do_sample=False时，temperature参数会被忽略，但top_p仍然生效。如果你想要确定性输出，把top_p也设成1.0。

Qwen-VL实战：更轻量但更敏感

Qwen-VL的API设计更现代，但它的视觉编码器对图片尺寸敏感。我测试过，224x224的输入效果最好，太大或太小都会导致细节丢失。

fromtransformersimportAutoModelForVision2Seq,AutoTokenizerfromqwen_vl_utilsimportprocess_vision_info model=AutoModelForVision2Seq.from_pretrained("Qwen/Qwen-VL-Chat",torch_dtype=torch.bfloat16,device_map="auto",trust_remote_code=True)tokenizer=AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat",trust_remote_code=True)# 这里踩过坑：Qwen-VL的图片输入必须是列表，即使只有一张messages=[{"role":"user","content":[{"type":"image","image":"shelf.jpg"},{"type":"text","text":"请识别图片中所有商品的名称，只输出名称列表，不要解释。"}]}]# 别这样写：直接传路径，process_vision_info会帮你加载，但不会做预处理text=tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)image_inputs,video_inputs=process_vision_info(messages)inputs=tokenizer(text=[text],images=image_inputs,padding=True,return_tensors="pt").to(model.device)withtorch.inference_mode():generated_ids=model.generate(**inputs,max_new_tokens=128,do_sample=False,temperature=0.1)response=tokenizer.batch_decode(generated_ids,skip_special_tokens=True)[0]print(response)

Qwen-VL有个特性：它对中文prompt的理解比LLaVA好，但英文prompt反而容易跑偏。如果你做中文场景，直接用中文prompt。

图像预处理：决定成败的细节

两个模型对图片的预处理要求不同，但有一个共同点：不要用JPEG压缩过高的图片。我遇到过一张质量因子为60的JPEG，LLaVA把货架上的“可乐”识别成了“雪碧”，因为标签上的红色被压缩成了橙色。

# 通用预处理：保持宽高比，填充到正方形defpreprocess_image(image_path,target_size=224):fromPILimportImage,ImageOps img=Image.open(image_path).convert("RGB")# 别这样写：直接resize会变形# 正确做法：计算缩放比例，填充黑边ratio=min(target_size/img.width,target_size/img.height)new_size=(int(img.width*ratio),int(img.height*ratio))img=img.resize(new_size,Image.LANCZOS)# 填充到正方形delta_w=target_size-new_size[0]delta_h=target_size-new_size[1]padding=(delta_w//2,delta_h//2,delta_w-delta_w//2,delta_h-delta_h//2)img=ImageOps.expand(img,padding,fill=(0,0,0))returnimg

这个函数我用了半年，效果稳定。注意填充颜色用黑色，别用白色，否则模型会把填充区域当成背景。

多轮对话与上下文管理

图像理解不总是单次查询。比如你要分析一张电路板图片，先问“有哪些芯片”，再追问“U3芯片的型号是什么”。这时候需要维护对话历史。

LLaVA的对话管理比较原始，需要手动拼接历史：

defbuild_llava_conversation(history,new_image=None,new_text=""):conv=[]forturninhistory:conv.append(f"USER:{turn['user']}\nASSISTANT:{turn['assistant']}")ifnew_image:conv.append(f"USER: <image>\n{new_text}\nASSISTANT:")else:conv.append(f"USER:{new_text}\nASSISTANT:")return"\n".join(conv)

Qwen-VL原生支持多轮，但注意每次都要传图片，即使图片没变。否则模型会忘记之前看到的视觉信息。

messages=[{"role":"user","content":[{"type":"image","image":"board.jpg"},{"type":"text","text":"有哪些芯片？"}]},{"role":"assistant","content":[{"type":"text","text":"有U1、U2、U3三个芯片。"}]},{"role":"user","content":[{"type":"image","image":"board.jpg"},{"type":"text","text":"U3的型号是什么？"}]}]

性能优化：从30秒到3秒

第一次跑LLaVA 7B，一张图推理花了30秒。后来做了三件事，压到3秒：

使用Flash Attention 2：安装flash-attn库，加载模型时加attn_implementation="flash_attention_2"，显存占用降低40%，速度提升2倍。
图片预处理缓存：同一张图片多次推理时，把image_tensor缓存起来，别每次都重新处理。
批量推理：如果有多张图片要分析，用batch_size=4同时推理，吞吐量提升3倍。

# 批量推理示例images=[preprocess_image(f"img_{i}.jpg")foriinrange(4)]image_tensors=torch.stack([image_processor.preprocess(img,return_tensors="pt")["pixel_values"].half().cuda()forimginimages])# 注意：batch推理时，prompt必须相同，否则需要padding