Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM

doi:10.1109/HPCA61900.2025.00129

	Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
	Lian Liu 1,2,3; Shixin Zhao 1,2; Bing Li 4; Haimeng Ren1,5 ; Zhaohui Xu1,5 ; Mengdi Wang 1,2; Xiaowei Li 1,2,3; Yinhe Han 1,2; Ying Wang 1,2
	2025-03-05
会议录名称	2025 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA)
ISSN	1530-0897
发表状态	已发表
DOI	10.1109/HPCA61900.2025.00129
摘要	The billion-scale Large Language Models (LLMs) necessitate deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the current trend. This has sparked extensive research into relocating LLM parameters from expensive GPUs to external host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance of existing solutions. This work introduces Hermes, a budget-friendly system that leverages the near-data processing units (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. We recognize that the inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed “hot” and “cold” neurons, respectively. Hot neurons, which consist of only approximately 20% of all weight parameters, account for 80% of the total computational load. In contrast, cold neurons make up the other 80% of parameters but are responsible for just 20% of the computational workload. Leveraging this observation, we propose a heterogeneous computing strategy: mapping hot neurons to a single computation-efficient GPU without large-capacity HBMs, while offloading cold neurons to NDP-DIMMs, which offer large memory size but limited computation capabilities. In addition, the dynamic nature of activation sparsity necessitates a real-time partition of hot and cold neurons and adaptive remapping of cold neurons across multiple NDP-DIMM modules. To tackle these issues, we introduce a lightweight predictor that ensures optimal real-time neuron partition and adjustment between GPU and NDP-DIMMs. Furthermore, we utilize a window-based online scheduling mechanism to maintain load balance among multiple NDP-DIMM modules. In summary, Hermes facilitates the deployment of LLaMA2-70B on consumer-grade hardware at a rate of 13.75 tokens/s and realizes an average 75.24 × speedup over the state-of-the-art offloading-based inference system on popular LLMs.
会议地点	Las Vegas, NV, USA
会议日期	1-5 March 2025
URL	查看原文
来源库	IEEE
文献类型	会议论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/514104
专题	信息科学与技术学院信息科学与技术学院_硕士生
作者单位	1.Chinese Academic of Sciences, Institute of Computing Technology 2.University of Chinese Academy of Sciences 3.Zhongguancun Laboratory 4.Chinese Academy of Sciences, Institute of Microelectronics 5.School of Information Science and Technology, ShanghaiTech University
推荐引用方式 GB/T 7714	Lian Liu,Shixin Zhao,Bing Li,et al. Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM[C],2025.