Benchmarking network fabrics for data distributed training of deep neural networks

Siddharth Samsi, Andrew Prout, Michael Jones, Andrew Kirby, Bill Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Antonio Rosa, Charles Yee, Albert Reuther, Jeremy Kepner

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

Original languageEnglish
Title of host publication2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728192192
DOIs
StatePublished - Sep 22 2020
Externally publishedYes
Event2020 IEEE High Performance Extreme Computing Conference, HPEC 2020 - Virtual, Waltham, United States
Duration: Sep 21 2020Sep 25 2020

Publication series

Name2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Conference

Conference2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
Country/TerritoryUnited States
CityVirtual, Waltham
Period09/21/2009/25/20

Fingerprint

Dive into the research topics of 'Benchmarking network fabrics for data distributed training of deep neural networks'. Together they form a unique fingerprint.

Cite this