TY - GEN
T1 - Benchmarking network fabrics for data distributed training of deep neural networks
AU - Samsi, Siddharth
AU - Prout, Andrew
AU - Jones, Michael
AU - Kirby, Andrew
AU - Arcand, Bill
AU - Bergeron, Bill
AU - Bestor, David
AU - Byun, Chansup
AU - Gadepally, Vijay
AU - Houle, Michael
AU - Hubbell, Matthew
AU - Klein, Anna
AU - Michaleas, Peter
AU - Milechin, Lauren
AU - Mullen, Julie
AU - Rosa, Antonio
AU - Yee, Charles
AU - Reuther, Albert
AU - Kepner, Jeremy
N1 - Funding Information:
This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. (FA8721-05-C-0002 and/or FA8702-15-D-0001). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/9/22
Y1 - 2020/9/22
N2 - Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.
AB - Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.
UR - http://www.scopus.com/inward/record.url?scp=85099336450&partnerID=8YFLogxK
U2 - 10.1109/HPEC43674.2020.9286232
DO - 10.1109/HPEC43674.2020.9286232
M3 - Conference contribution
AN - SCOPUS:85099336450
T3 - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
BT - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 September 2020 through 25 September 2020
ER -