The MIT Supercloud Workload Classification Challenge

Benny J. Tang, Qiqi Chen, Matthew L. Weiss, Nathan C. Frey, Joseph McDonald, David Bestor, Charles Yee, William Arcand, William Bergeron, Chansup Byun, Daniel Edelman, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Anna Klein, Adam Michaleas, Peter Michaleas, Lauren Milechin, Julia MullenAndrew Prout, Albert Reuther, Antonio Rosa, Andrew Bowne, Lindsey McEvoy, Baolin Li, Devesh Tiwari, Jiay Gadepally, Siddharth Samsi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

High-Performance Computing (HPC) centers and cloud providers support an increasingly diverse set of applications on heterogenous hardware. As Artificial Intelligence (AI) and Machine Learning (ML) workloads have become an increasingly larger share of the compute workloads, new approaches to optimized resource usage, allocation, and deployment of new AI frameworks are needed. By identifying compute workloads and their utilization characteristics, HPC systems may be able to better match available resources with the application demand. By leveraging datacenter instrumentation, it may be possible to develop AI-based approaches that can identify workloads and provide feedback to researchers and datacenter operators for improving operational efficiency. To enable this research, we released the MIT Supercloud Dataset, which provides de-tailed monitoring logs from the MIT Supercloud cluster. This dataset includes CPU and GPU usage by jobs, memory usage, and file system logs. In this paper, we present a workload classification challenge based on this dataset. We introduce a labelled dataset that can be used to develop new approaches to workload classification and present initial results based on existing approaches. The goal of this challenge is to foster algorithmic innovations in the analysis of compute workloads that can achieve higher accuracy than existing methods. Data and code will be made publicly available via the Datacenter Challenge website: https://dcc.mit.edu.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages708-714
Number of pages7
ISBN (Electronic)9781665497473
DOIs
StatePublished - 2022
Externally publishedYes
Event36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022 - Virtual, Online, France
Duration: May 30 2022Jun 3 2022

Publication series

NameProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022

Conference

Conference36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022
Country/TerritoryFrance
CityVirtual, Online
Period05/30/2206/3/22

Fingerprint

Dive into the research topics of 'The MIT Supercloud Workload Classification Challenge'. Together they form a unique fingerprint.

Cite this