HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

doi:arXiv:2403.13447

	HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
	Zhang, Wenqiao 1; Lin, Tianwei2 ; Liu, Jiang 3; Shu, Fangxun 4; Li, Haoyuan 1; Zhang, Lei 4; Wanggui, He 4; Zhou, Hao 5; Lv, Zheqi 1; Jiang, Hao 4; Li, Juncheng 1; Tang, Siliang 1; Zhuang, Yueting 1
	2024-03-20
状态	已发表
摘要	Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, e.g., LLaVA, transforms visual features into text-like tokens using a static vision-language mapper, thereby enabling static LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the static tuning strategy 1 that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. 2.
DOI	arXiv:2403.13447
相关网址	查看原文
出处	Arxiv
WOS记录号	PPRN:88213171
WOS类目	Computer Science, Artificial Intelligence ; Computer Science, Interdisciplinary Applications ; Computer Science, Software Engineering
文献类型	预印本
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372954
专题	信息科学与技术学院_本科生
通讯作者	Zhang, Wenqiao
作者单位	1.Zhejiang Univ, Hangzhou, Peoples R China 2.Shanghai Tech Univ, Shanghai, Peoples R China 3.Chongqing Univ, Chongqing, Peoples R China 4.Alibaba Grp, Hangzhou, Peoples R China 5.Harbin Inst Technol, Harbin, Peoples R China
推荐引用方式 GB/T 7714	Zhang, Wenqiao,Lin, Tianwei,Liu, Jiang,et al. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. 2024.