Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

doi:arXiv:2408.08490

	Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels
	Wu, Meng 1,2; Qiu, Jingkai3 ; Yan, Mingyu 1,2; Li, Wenming 1,2; Zhang, Yang 4; Zhang, Zhimin 1,2; Ye, Xiaochun 1,2; Fan, Dongrui 1,2
	2024-08-16
状态	已发表
摘要	Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to numerous short-execution-time and memory-bound CUDA kernels during HGNN training. To address this issue, we introduce HiFuse, an enhancement for PyTorch Geometric designed to accelerate mini-batch HGNN training on CPU-GPU systems. From the data perspective, we reorganize and merge multiple smaller vertex feature matrices into larger ones, enabling a single kernel to process larger data chunks. This efficiently exploits data locality, reduces the kernel launch overhead, and improves overall GPU utilization. From the workflow perspective, we sophisticatedly offload the construction of semantic graphs from GPU to CPU to reduce the number of CUDA kernels. To meet the parallelism requirements on CPU and ensure seamless execution between CPU and GPU, we employ parallelization techniques including multi-threading and asynchronous pipeline. This allows different stages of the process to overlap, enhancing GPU utilization and reducing end-to-end execution latency, leading to a more efficient and balanced use of computational resources. Through extensive experiments, HiFuse demonstrates an average 2.38 times speedup compared to a state-of-the-art solution.
关键词	HGNNs GPU Acceleration
DOI	arXiv:2408.08490
相关网址	查看原文
出处	Arxiv
WOS记录号	PPRN:91462333
WOS类目	Computer Science, Hardware& Architecture
文献类型	预印本
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/415563
专题	信息科学与技术学院_硕士生
通讯作者	Yan, Mingyu
作者单位	1.Chinese Acad Sci, Inst Comp Technol, SKLP, Beijing, Peoples R China 2.Univ Chinese Acad Sci, Beijing, Peoples R China 3.Shanghai Tech Univ, Shanghai, Peoples R China 4.Yancheng Zhongke High Thoughput Comp Res Inst Co LTD, Jiangsu, Peoples R China
推荐引用方式 GB/T 7714	Wu, Meng,Qiu, Jingkai,Yan, Mingyu,et al. Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels. 2024.