AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications

Baolin Li, Rohin Arora, Siddharth Samsi, Tirthak Patel, William Arcand, David Bestor, Chansup Byun, Rohan Basu Roy, Bill Bergeron, John Holodnak, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Anna Klein, Peter Michaleas, Joseph Mcdonald, Lauren Milechin, Julie Mullen, Andrew ProutBenjamin Price, Albert Reuther, Antonio Rosa, Matthew Weiss, Charles Yee, Daniel Edelman, Allan Vanterpool, Anson Cheng, Vijay Gadepally, Devesh Tiwari

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Production high-performance computing (HPC) systems are adopting and integrating GPUs into their design to accommodate artificial intelligence (AI), machine learning, and data visualization workloads. To aid with the design and operations of new and existing GPU-based large-scale systems, we provide a detailed characterization of system operations, job characteristics, user behavior, and trends on a contemporary GPU-accelerated production HPC system. Our insights indicate that the pre-mature phases in modern AI workflow take up significant GPU hours while underutilizing GPUs, which opens up the opportunity for a multi-tier system. Finally, we provide various potential recommendations and areas for future investment for system architects, operators, and users.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
PublisherIEEE Computer Society
Pages1224-1237
Number of pages14
ISBN (Electronic)9781665420273
DOIs
StatePublished - 2022
Externally publishedYes
Event28th Annual IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022 - Virtual, Online, Korea, Republic of
Duration: Apr 2 2022Apr 6 2022

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
Volume2022-April
ISSN (Print)1530-0897

Conference

Conference28th Annual IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period04/2/2204/6/22

Keywords

  • Cluster Characterization
  • Deep Learning
  • GPU Datacenter
  • HPC
  • System Operation

Fingerprint

Dive into the research topics of 'AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications'. Together they form a unique fingerprint.

Cite this