TY - GEN
T1 - Lustre, hadoop, accumulo
AU - Kepner, Jeremy
AU - Arcand, William
AU - Bestor, David
AU - Bergeron, Bill
AU - Byun, Chansup
AU - Edwards, Lauren
AU - Gadepally, Vijay
AU - Hubbell, Matthew
AU - Michaleas, Peter
AU - Mullen, Julie
AU - Prout, Andrew
AU - Rosa, Antonio
AU - Yee, Charles
AU - Reuther, Albert
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2015 IEEE.
PY - 2015/11/9
Y1 - 2015/11/9
N2 - Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons of these technologies. This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose workloads. Accumulo provides 105 lower latency on random lookups than either Lustre or Hadoop but Accumulo's bulk bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.
AB - Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons of these technologies. This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose workloads. Accumulo provides 105 lower latency on random lookups than either Lustre or Hadoop but Accumulo's bulk bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.
KW - Accumulo
KW - Big Data
KW - Hadoop
KW - Insider
KW - Lustre
KW - Parallel Performance
UR - http://www.scopus.com/inward/record.url?scp=84964879031&partnerID=8YFLogxK
U2 - 10.1109/HPEC.2015.7322476
DO - 10.1109/HPEC.2015.7322476
M3 - Conference contribution
AN - SCOPUS:84964879031
T3 - 2015 IEEE High Performance Extreme Computing Conference, HPEC 2015
BT - 2015 IEEE High Performance Extreme Computing Conference, HPEC 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE High Performance Extreme Computing Conference, HPEC 2015
Y2 - 15 September 2015 through 17 September 2015
ER -