Large scale parallelization using file-based communications

Chansup Byun, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI-Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.

Original languageEnglish
Title of host publication2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728150208
DOIs
StatePublished - Sep 2019
Externally publishedYes
Event2019 IEEE High Performance Extreme Computing Conference, HPEC 2019 - Waltham, United States
Duration: Sep 24 2019Sep 26 2019

Publication series

Name2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Conference

Conference2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
Country/TerritoryUnited States
CityWaltham
Period09/24/1909/26/19

Keywords

  • File-based communication
  • Filesystem permission
  • Large scale
  • Parallelization
  • Scp
  • Security

Fingerprint

Dive into the research topics of 'Large scale parallelization using file-based communications'. Together they form a unique fingerprint.

Cite this