Adopting the MapReduce framework to pre-train 1-D and 2-D protein structure predictors with large protein datasets

Jesse Eickholt, Suman Karki

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Sequence based machine learning approaches for 1-D and 2-D protein structure prediction tasks have long been limited by relatively small datasets, namely proteins with experimentally determined structure. Recent advances in machine learning provide a means of using unlabeled data and, as a result, this opens up access to a much larger sequence space in the context of protein structure prediction. Here we present a 3-stage pipeline to construct a representative protein sequence dataset, generate training data and pre-train deep network models for 1-D and 2-D protein structure prediction tasks. To handle the complexities of managing the large dataset, we implemented our pipeline using the MapReduce framework. This allowed us to leverage existing tools such as Hadoop. The result is the ability to apply large amounts of novel, protein sequence data to 1-D and 2-D protein structure prediction. We also used our pipeline to curate a non-redundant protein sequence dataset that we have made available with accompanying data.

Original languageEnglish
Title of host publicationProceedings - 2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014
EditorsHuiru Zheng, Xiaohua Tony Hu, Daniel Berrar, Yadong Wang, Werner Dubitzky, Jin-Kao Hao, Kwang-Hyun Cho, David Gilbert
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages23-29
Number of pages7
ISBN (Electronic)9781479956692
DOIs
StatePublished - Dec 29 2014
Event2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014 - Belfast, United Kingdom
Duration: Nov 2 2014Nov 5 2014

Publication series

NameProceedings - 2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014

Conference

Conference2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014
Country/TerritoryUnited Kingdom
CityBelfast
Period11/2/1411/5/14

Keywords

  • MapReduce
  • deep networks
  • protein structure prediction

Fingerprint

Dive into the research topics of 'Adopting the MapReduce framework to pre-train 1-D and 2-D protein structure predictors with large protein datasets'. Together they form a unique fingerprint.

Cite this