| |||||||
ShanghaiTech University Knowledge Management System
Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy | |
2024-07-26 | |
会议录名称 | 2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS)
![]() |
ISSN | 1063-6927 |
页码 | 59-70 |
发表状态 | 已发表 |
DOI | 10.1109/ICDCS60910.2024.00015 |
摘要 | We introduce Portus, an efficient checkpointing system for DNN models. The core of Portus is a three-level index structure and a direct RDMA datapath that enables fast check-points between GPUs and persistent memory in a serialization-free way. Portus offers a zero-copy approach between GPU and persistent memory without involving main memory and kernel crossings to underlying file systems. Portus also applies an asynchronous mechanism to hide the checkpointing overhead in the model training procedures. We integrated a Portus prototype into a high-performance AI cluster with NVIDIA®V100 and A40 GPUs and Intel®Optane™persistent memory, then evaluated its performance in both single-GPU and multi-GPU large model training scenarios. Experiment results show that compared to a state-of-the-art checkpointing system, Portus achieves up to 9.23× and 7.0× speedup in checkpointing and restoring, respectively. Portus achieves up to 2.6× higher throughput and 8× faster checkpointing operation on a large language model, GPT-22B. |
关键词 | Graphics processing unit Memory architecture Problem oriented languages Static random access storage Check pointing Index structure Model training Performance Persistence memory Persistent memory RDMA System for AI Three-level Zero copy |
会议名称 | 44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024 |
会议地点 | Jersey City, NJ, USA |
会议日期 | 23-26 July 2024 |
URL | 查看原文 |
收录类别 | EI |
语种 | 英语 |
出版者 | Institute of Electrical and Electronics Engineers Inc. |
EI入藏号 | 20243717014778 |
EI主题词 | Computer graphics equipment |
EISSN | 2575-8411 |
EI分类号 | 1102.3.1 ; 1103 ; 1103.1 ; 1103.2 ; 1104 ; 1106.1.1 ; 714.2 Semiconductor Devices and Integrated Circuits |
原始文献类型 | Conference article (CA) |
来源库 | IEEE |
文献类型 | 会议论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/414241 |
专题 | 信息科学与技术学院_硕士生 信息科学与技术学院_PI研究组_殷树组 信息科学与技术学院_博士生 |
作者单位 | 1.Shanghaitech University, China 2.Shanghai Engineering Research Center of Intelligent Vision and Imaging, China |
第一作者单位 | 上海科技大学 |
第一作者的第一单位 | 上海科技大学 |
推荐引用方式 GB/T 7714 | Yuanhao Li,Tianyuan Wu,Guancheng Li,et al. Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy[C]:Institute of Electrical and Electronics Engineers Inc.,2024:59-70. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。