消息
×
loading..
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
2023-11-13
状态已发表
摘要

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. 

DOIarXiv:2311.07575
相关网址查看原文
出处Arxiv
WOS记录号PPRN:86136116
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Interdisciplinary Applications ; Computer Science, Software Engineering
文献类型预印本
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/348081
专题信息科学与技术学院_硕士生
作者单位
1.Shanghai AI Lab, Shanghai, Peoples R China
2.CUHK, MMLab, Hong Kong, Peoples R China
3.ShanghaiTech Univ, Shanghai, Peoples R China
推荐引用方式
GB/T 7714
Lin, Ziyi,Liu, Chris,Zhang, Renrui,et al. SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. 2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Lin, Ziyi]的文章
[Liu, Chris]的文章
[Zhang, Renrui]的文章
百度学术
百度学术中相似的文章
[Lin, Ziyi]的文章
[Liu, Chris]的文章
[Zhang, Renrui]的文章
必应学术
必应学术中相似的文章
[Lin, Ziyi]的文章
[Liu, Chris]的文章
[Zhang, Renrui]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。