TY - GEN
T1 - Large scale parallelization using file-based communications
AU - Byun, Chansup
AU - Klein, Anna
AU - Michaleas, Peter
AU - Mullen, Julie
AU - Prout, Andrew
AU - Rosa, Antonio
AU - Samsi, Siddharth
AU - Yee, Charles
AU - Reuther, Albert
AU - Kepner, Jeremy
AU - Arcand, William
AU - Bestor, David
AU - Bergeron, Bill
AU - Gadepally, Vijay
AU - Houle, Michael
AU - Hubbell, Matthew
AU - Jones, Michael
N1 - Funding Information:
This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI-Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.
AB - In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI-Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system.
KW - File-based communication
KW - Filesystem permission
KW - Large scale
KW - Parallelization
KW - Scp
KW - Security
UR - http://www.scopus.com/inward/record.url?scp=85076714208&partnerID=8YFLogxK
U2 - 10.1109/HPEC.2019.8916221
DO - 10.1109/HPEC.2019.8916221
M3 - Conference contribution
AN - SCOPUS:85076714208
T3 - 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
BT - 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 September 2019 through 26 September 2019
ER -