TY - GEN
T1 - Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
AU - Reuther, Albert
AU - Kepner, Jeremy
AU - Byun, Chansup
AU - Samsi, Siddharth
AU - Arcand, William
AU - Bestor, David
AU - Bergeron, Bill
AU - Gadepally, Vijay
AU - Houle, Michael
AU - Hubbell, Matthew
AU - Jones, Michael
AU - Klein, Anna
AU - Milechin, Lauren
AU - Mullen, Julia
AU - Prout, Andrew
AU - Rosa, Antonio
AU - Yee, Charles
AU - Michaleas, Peter
N1 - Funding Information:
This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/11/26
Y1 - 2018/11/26
N2 - Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges - in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds. These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms.
AB - Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges - in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds. These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms.
KW - Data analytics
KW - High performance computing
KW - Interactive
KW - Machine learning
KW - Manycore
KW - Scheduler
UR - http://www.scopus.com/inward/record.url?scp=85060095828&partnerID=8YFLogxK
U2 - 10.1109/HPEC.2018.8547629
DO - 10.1109/HPEC.2018.8547629
M3 - Conference contribution
AN - SCOPUS:85060095828
T3 - 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018
BT - 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 September 2018 through 27 September 2018
ER -