当前位置: 首页 > news >正文

【vLLM 学习】Vision Language Embedding

vLLM 是一款专为大语言模型推理加速而设计的框架,实现了 KV 缓存内存几乎零浪费,解决了内存管理瓶颈问题。

更多 vLLM 中文文档及教程可访问 →https://go.hyper.ai/Wa62f

*在线运行 vLLM 入门教程:零基础分步指南

源码 examples/offline_inference/vision_language_embedding.py

""" 本示例展示如何使用 vLLM 执行离线推理,并在视觉语言模型上 使用正确的提示格式生成多模态嵌入。 对于大多数模型,提示格式应遵循 HuggingFace 模型库中 相应的示例格式。 """ from argparse import Namespace from dataclasses import asdict from typing import Literal, NamedTuple, Optional, TypedDict, Union, get_args from PIL.Image import Image from vllm import LLM, EngineArgs from vllm.multimodal.utils import fetch_image from vllm.utils import FlexibleArgumentParser class TextQuery(TypedDict): modality: Literal["text"] text: str class ImageQuery(TypedDict): modality: Literal["image"] image: Image class TextImageQuery(TypedDict): modality: Literal["text+image"] text: str image: Image QueryModality = Literal["text", "image", "text+image"] Query = Union[TextQuery, ImageQuery, TextImageQuery] class ModelRequestData(NamedTuple): engine_args: EngineArgs prompt: str image: Optional[Image] def run_e5_v(query: Query) -> ModelRequestData: llama3_template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n' # noqa: E501 if query["modality"] == "text": text = query["text"] prompt = llama3_template.format( f"{text}\nSummary above sentence in one word: ") image = None elif query["modality"] == "image": prompt = llama3_template.format( "<image>\nSummary above image in one word: ") image = query["image"] else: modality = query['modality'] raise ValueError(f"Unsupported query modality: '{modality}'") engine_args = EngineArgs( model="royokong/e5-v", task="embed", max_model_len=4096, ) return ModelRequestData( engine_args=engine_args, prompt=prompt, image=image, ) def run_vlm2vec(query: Query) -> ModelRequestData: if query["modality"] == "text": text = query["text"] prompt = f"Find me an everyday image that matches the given caption: {text}" # noqa: E501 image = None elif query["modality"] == "image": prompt = "<|image_1|> Find a day-to-day image that looks similar to the provided image." # noqa: E501 image = query["image"] elif query["modality"] == "text+image": text = query["text"] prompt = f"<|image_1|> Represent the given image with the following question: {text}" # noqa: E501 image = query["image"] else: modality = query['modality'] raise ValueError(f"Unsupported query modality: '{modality}'") engine_args = EngineArgs( model="TIGER-Lab/VLM2Vec-Full", task="embed", trust_remote_code=True, mm_processor_kwargs={"num_crops": 4}, ) return ModelRequestData( engine_args=engine_args, prompt=prompt, image=image, ) def get_query(modality: QueryModality): if modality == "text": return TextQuery(modality="text", text="A dog sitting in the grass") if modality == "image": return ImageQuery( modality="image", image=fetch_image( "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/American_Eskimo_Dog.jpg/360px-American_Eskimo_Dog.jpg" # noqa: E501 ), ) if modality == "text+image": return TextImageQuery( modality="text+image", text="A cat standing in the snow.", image=fetch_image( "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Felis_catus-cat_on_snow.jpg/179px-Felis_catus-cat_on_snow.jpg" # noqa: E501 ), ) msg = f"Modality {modality} is not supported." raise ValueError(msg) def run_encode(model: str, modality: QueryModality, seed: Optional[int]): query = get_query(modality) req_data = model_example_map[model](query) engine_args = asdict(req_data.engine_args) | {"seed": seed} llm = LLM(**engine_args) mm_data = {} if req_data.image is not None: mm_data["image"] = req_data.image outputs = llm.embed({ "prompt": req_data.prompt, "multi_modal_data": mm_data, }) for output in outputs: print(output.outputs.embedding) def main(args: Namespace): run_encode(args.model_name, args.modality, args.seed) model_example_map = { "e5_v": run_e5_v, "vlm2vec": run_vlm2vec, } if __name__ == "__main__": parser = FlexibleArgumentParser( description='Demo on using vLLM for offline inference with ' 'vision language models for multimodal embedding') parser.add_argument('--model-name', '-m', type=str, default="vlm2vec", choices=model_example_map.keys(), help='The name of the embedding model.') parser.add_argument('--modality', type=str, default="image", choices=get_args(QueryModality), help='Modality of the input.') parser.add_argument("--seed", type=int, default=None, help="Set the seed when initializing `vllm.LLM`.") args = parser.parse_args() main(args)
http://www.jsqmd.com/news/496954/

相关文章:

  • ofa_image-caption在工业质检中的探索:缺陷图→英文描述→结构化报告生成
  • 2026无人咖啡机深度测评,设备性能、成本与维护要点总结 - 品牌2026
  • SOONet实战案例:短视频平台用‘搞笑桥段’查询自动提取爆款片段用于推荐
  • Janus-Pro-7B部署教程:Mac M系列芯片Metal加速运行可行性验证
  • Phi-3-mini-128k-instruct实战手册:vLLM参数详解+Chainlit自定义UI改造指南
  • GTE-Pro语义检索入门必看:对比Elasticsearch关键词匹配的5大优势
  • 实时手机检测-通用实战教程:结合OpenCV后处理实现手机区域裁剪
  • FLUX.1海景美女图效果实测:1024×1024分辨率下GPU显存溢出解决方案(降步数+调batch)
  • 2026年风电用漆包铜扁线厂家推荐排行榜:高耐候绝缘扁铜线,风电绕组专用电磁线优质品牌深度解析 - 品牌企业推荐师(官方)
  • 【西北工业大学主办,SAE出版】第二届航空航天工程与材料技术国际会议(AEMT 2026)
  • 【SPIE出版,南昌大学主办】2026年计算机视觉与神经网络国际学术会议(CVNN 2026)
  • HY-Motion 1.0可部署方案:支持国产昇腾/寒武纪平台的适配路径
  • Neeshck-Z-lmage_LYX_v2实战教程:中文提示词工程与LoRA风格匹配技巧
  • Kook Zimage真实幻想Turbo快速部署:阿里云/腾讯云GPU服务器一键镜像部署方案
  • DAMOYOLO-S效果展示:极端角度(俯视/仰视)下目标检测鲁棒性验证
  • lingbot-depth-pretrain-vitl-14实战教程:基于/root/assets/lingbot-depth-main/examples测试集验证
  • 低GI/控糖食品哪个品牌控糖效果最好? - 中媒介
  • OneAPI新能源运维:Gemini分析光伏板热成像图+千问生成故障诊断报告+混元预测发电量
  • Qwen3-4B-Thinking-2507-GPT-5-Codex-Distill-GGUF惊艳效果展示:复杂条件逻辑推导与注释生成示例
  • 论文写不动?千笔AI,开源免费的论文写作神器!
  • Phi-3-Mini-128K应用实践:医疗科普内容生成——基于权威指南长文本
  • 控体人群推荐哪个牌子的食品? - 中媒介
  • 2026年 漆包扁线厂家推荐榜单:江苏优质品牌,高绝缘耐温扁铜线、电机绕组专用漆包线源头工厂精选 - 品牌企业推荐师(官方)
  • RexUniNLU多任务NLP系统详解:从安装到JSON输出的全流程步骤
  • 智谱AI GLM-Image教程:Gradio状态管理与跨组件数据传递
  • Kimi-VL-A3B-Thinking开源部署避坑清单:常见CUDA版本冲突、tokenizers兼容问题
  • OFA VQA开源镜像实践:企业内网离线环境下的安全部署
  • WeKnora入门必看:如何用任意文本构建专属AI专家?一文详解操作全流程
  • 在现行法律框架下,AI智能体是否具备法律主体资格?如果OpenClaw自动签订了一份电子合同,合同效力如何认定?
  • Qwen3-ASR-0.6B精彩案例:教育行业课堂录音自动字幕生成演示