消息
×
loading..
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
2025-03-05
会议录名称2025 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA)
ISSN1530-0897
发表状态已发表
DOI10.1109/HPCA61900.2025.00129
摘要The billion-scale Large Language Models (LLMs) necessitate deployment on expensive server-grade GPUs with large-storage HBMs and abundant computation capability. As LLM-assisted services become popular, achieving cost-effective LLM inference on budget-friendly hardware becomes the current trend. This has sparked extensive research into relocating LLM parameters from expensive GPUs to external host memory. However, the restricted bandwidth between the host and GPU memory limits the inference performance of existing solutions. This work introduces Hermes, a budget-friendly system that leverages the near-data processing units (NDP) within commodity DRAM DIMMs to enhance the performance of a single consumer-grade GPU, achieving efficient LLM inference. We recognize that the inherent activation sparsity in LLMs naturally divides weight parameters into two categories, termed “hot” and “cold” neurons, respectively. Hot neurons, which consist of only approximately 20% of all weight parameters, account for 80% of the total computational load. In contrast, cold neurons make up the other 80% of parameters but are responsible for just 20% of the computational workload. Leveraging this observation, we propose a heterogeneous computing strategy: mapping hot neurons to a single computation-efficient GPU without large-capacity HBMs, while offloading cold neurons to NDP-DIMMs, which offer large memory size but limited computation capabilities. In addition, the dynamic nature of activation sparsity necessitates a real-time partition of hot and cold neurons and adaptive remapping of cold neurons across multiple NDP-DIMM modules. To tackle these issues, we introduce a lightweight predictor that ensures optimal real-time neuron partition and adjustment between GPU and NDP-DIMMs. Furthermore, we utilize a window-based online scheduling mechanism to maintain load balance among multiple NDP-DIMM modules. In summary, Hermes facilitates the deployment of LLaMA2-70B on consumer-grade hardware at a rate of 13.75 tokens/s and realizes an average 75.24 × speedup over the state-of-the-art offloading-based inference system on popular LLMs.
会议地点Las Vegas, NV, USA
会议日期1-5 March 2025
URL查看原文
来源库IEEE
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/514104
专题信息科学与技术学院
信息科学与技术学院_硕士生
作者单位
1.Chinese Academic of Sciences, Institute of Computing Technology
2.University of Chinese Academy of Sciences
3.Zhongguancun Laboratory
4.Chinese Academy of Sciences, Institute of Microelectronics
5.School of Information Science and Technology, ShanghaiTech University
推荐引用方式
GB/T 7714
Lian Liu,Shixin Zhao,Bing Li,et al. Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM[C],2025.
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Lian Liu]的文章
[Shixin Zhao]的文章
[Bing Li]的文章
百度学术
百度学术中相似的文章
[Lian Liu]的文章
[Shixin Zhao]的文章
[Bing Li]的文章
必应学术
必应学术中相似的文章
[Lian Liu]的文章
[Shixin Zhao]的文章
[Bing Li]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。