ShanghaiTech University Knowledge Management System
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | |
2024-03-20 | |
状态 | 已发表 |
摘要 | Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, e.g., LLaVA, transforms visual features into text-like tokens using a static vision-language mapper, thereby enabling static LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the static tuning strategy 1 that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. 2. |
DOI | arXiv:2403.13447 |
相关网址 | 查看原文 |
出处 | Arxiv |
WOS记录号 | PPRN:88213171 |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Interdisciplinary Applications ; Computer Science, Software Engineering |
文献类型 | 预印本 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372954 |
专题 | 信息科学与技术学院_本科生 |
通讯作者 | Zhang, Wenqiao |
作者单位 | 1.Zhejiang Univ, Hangzhou, Peoples R China 2.Shanghai Tech Univ, Shanghai, Peoples R China 3.Chongqing Univ, Chongqing, Peoples R China 4.Alibaba Grp, Hangzhou, Peoples R China 5.Harbin Inst Technol, Harbin, Peoples R China |
推荐引用方式 GB/T 7714 | Zhang, Wenqiao,Lin, Tianwei,Liu, Jiang,et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. 2024. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。