TY - GEN
T1 - AI-Enabling Workloads on Large-Scale GPU-Accelerated System
T2 - 28th Annual IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
AU - Li, Baolin
AU - Arora, Rohin
AU - Samsi, Siddharth
AU - Patel, Tirthak
AU - Arcand, William
AU - Bestor, David
AU - Byun, Chansup
AU - Roy, Rohan Basu
AU - Bergeron, Bill
AU - Holodnak, John
AU - Houle, Michael
AU - Hubbell, Matthew
AU - Jones, Michael
AU - Kepner, Jeremy
AU - Klein, Anna
AU - Michaleas, Peter
AU - Mcdonald, Joseph
AU - Milechin, Lauren
AU - Mullen, Julie
AU - Prout, Andrew
AU - Price, Benjamin
AU - Reuther, Albert
AU - Rosa, Antonio
AU - Weiss, Matthew
AU - Yee, Charles
AU - Edelman, Daniel
AU - Vanterpool, Allan
AU - Cheng, Anson
AU - Gadepally, Vijay
AU - Tiwari, Devesh
N1 - Funding Information:
Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Production high-performance computing (HPC) systems are adopting and integrating GPUs into their design to accommodate artificial intelligence (AI), machine learning, and data visualization workloads. To aid with the design and operations of new and existing GPU-based large-scale systems, we provide a detailed characterization of system operations, job characteristics, user behavior, and trends on a contemporary GPU-accelerated production HPC system. Our insights indicate that the pre-mature phases in modern AI workflow take up significant GPU hours while underutilizing GPUs, which opens up the opportunity for a multi-tier system. Finally, we provide various potential recommendations and areas for future investment for system architects, operators, and users.
AB - Production high-performance computing (HPC) systems are adopting and integrating GPUs into their design to accommodate artificial intelligence (AI), machine learning, and data visualization workloads. To aid with the design and operations of new and existing GPU-based large-scale systems, we provide a detailed characterization of system operations, job characteristics, user behavior, and trends on a contemporary GPU-accelerated production HPC system. Our insights indicate that the pre-mature phases in modern AI workflow take up significant GPU hours while underutilizing GPUs, which opens up the opportunity for a multi-tier system. Finally, we provide various potential recommendations and areas for future investment for system architects, operators, and users.
KW - Cluster Characterization
KW - Deep Learning
KW - GPU Datacenter
KW - HPC
KW - System Operation
UR - http://www.scopus.com/inward/record.url?scp=85129515260&partnerID=8YFLogxK
U2 - 10.1109/HPCA53966.2022.00093
DO - 10.1109/HPCA53966.2022.00093
M3 - Conference contribution
AN - SCOPUS:85129515260
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 1224
EP - 1237
BT - Proceedings - 2022 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
PB - IEEE Computer Society
Y2 - 2 April 2022 through 6 April 2022
ER -