TY - GEN
T1 - The MIT Supercloud Dataset
AU - Samsi, Siddharth
AU - Weiss, Matthew L.
AU - Bestor, David
AU - Li, Baolin
AU - Jones, Michael
AU - Reuther, Albert
AU - Edelman, Daniel
AU - Arcand, William
AU - Byun, Chansup
AU - Holodnack, John
AU - Hubbell, Matthew
AU - Kepner, Jeremy
AU - Klein, Anna
AU - McDonald, Joseph
AU - Michaleas, Adam
AU - Michaleas, Peter
AU - Milechin, Lauren
AU - Mullen, Julia
AU - Yee, Charles
AU - Price, Benjamin
AU - Prout, Andrew
AU - Rosa, Antonio
AU - Vanterpool, Allan
AU - McEvoy, Lindsey
AU - Cheng, Anson
AU - Tiwari, Devesh
AU - Gadepally, Vijay
N1 - Funding Information:
Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. §Corresponding author. Email : sid@ll.mit.edu
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frameworks, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu.
AB - Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frameworks, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu.
UR - http://www.scopus.com/inward/record.url?scp=85123464595&partnerID=8YFLogxK
U2 - 10.1109/HPEC49654.2021.9622850
DO - 10.1109/HPEC49654.2021.9622850
M3 - Conference contribution
AN - SCOPUS:85123464595
T3 - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
BT - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Y2 - 20 September 2021 through 24 September 2021
ER -