消息
×
loading..
Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy
2024-07-26
会议录名称2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS)
ISSN1063-6927
页码59-70
发表状态已发表
DOI10.1109/ICDCS60910.2024.00015
摘要

We introduce Portus, an efficient checkpointing system for DNN models. The core of Portus is a three-level index structure and a direct RDMA datapath that enables fast check-points between GPUs and persistent memory in a serialization-free way. Portus offers a zero-copy approach between GPU and persistent memory without involving main memory and kernel crossings to underlying file systems. Portus also applies an asynchronous mechanism to hide the checkpointing overhead in the model training procedures. We integrated a Portus prototype into a high-performance AI cluster with NVIDIA®V100 and A40 GPUs and Intel®Optane™persistent memory, then evaluated its performance in both single-GPU and multi-GPU large model training scenarios. Experiment results show that compared to a state-of-the-art checkpointing system, Portus achieves up to 9.23× and 7.0× speedup in checkpointing and restoring, respectively. Portus achieves up to 2.6× higher throughput and 8× faster checkpointing operation on a large language model, GPT-22B.

关键词Graphics processing unit Memory architecture Problem oriented languages Static random access storage Check pointing Index structure Model training Performance Persistence memory Persistent memory RDMA System for AI Three-level Zero copy
会议名称44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024
会议地点Jersey City, NJ, USA
会议日期23-26 July 2024
URL查看原文
收录类别EI
语种英语
出版者Institute of Electrical and Electronics Engineers Inc.
EI入藏号20243717014778
EI主题词Computer graphics equipment
EISSN2575-8411
EI分类号1102.3.1 ; 1103 ; 1103.1 ; 1103.2 ; 1104 ; 1106.1.1 ; 714.2 Semiconductor Devices and Integrated Circuits
原始文献类型Conference article (CA)
来源库IEEE
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/414241
专题信息科学与技术学院_硕士生
信息科学与技术学院_PI研究组_殷树组
信息科学与技术学院_博士生
作者单位
1.Shanghaitech University, China
2.Shanghai Engineering Research Center of Intelligent Vision and Imaging, China
第一作者单位上海科技大学
第一作者的第一单位上海科技大学
推荐引用方式
GB/T 7714
Yuanhao Li,Tianyuan Wu,Guancheng Li,et al. Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy[C]:Institute of Electrical and Electronics Engineers Inc.,2024:59-70.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Yuanhao Li]的文章
[Tianyuan Wu]的文章
[Guancheng Li]的文章
百度学术
百度学术中相似的文章
[Yuanhao Li]的文章
[Tianyuan Wu]的文章
[Guancheng Li]的文章
必应学术
必应学术中相似的文章
[Yuanhao Li]的文章
[Tianyuan Wu]的文章
[Guancheng Li]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。