The MIT Supercloud Dataset

Siddharth Samsi, Matthew L. Weiss, David Bestor, Baolin Li, Michael Jones, Albert Reuther, Daniel Edelman, William Arcand, Chansup Byun, John Holodnack, Matthew Hubbell, Jeremy Kepner, Anna Klein, Joseph McDonald, Adam Michaleas, Peter Michaleas, Lauren Milechin, Julia Mullen, Charles Yee, Benjamin PriceAndrew Prout, Antonio Rosa, Allan Vanterpool, Lindsey McEvoy, Anson Cheng, Devesh Tiwari, Vijay Gadepally

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frameworks, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu.

Original languageEnglish
Title of host publication2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665423694
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 IEEE High Performance Extreme Computing Conference, HPEC 2021 - Virtual, Online, United States
Duration: Sep 20 2021Sep 24 2021

Publication series

Name2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Conference

Conference2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Country/TerritoryUnited States
CityVirtual, Online
Period09/20/2109/24/21

Fingerprint

Dive into the research topics of 'The MIT Supercloud Dataset'. Together they form a unique fingerprint.

Cite this