Data Integration Using Model-Based Boosting

Bin Li, Somsubhra Chakraborty, David C. Weindorf, Qingzhao Yu

Research output: Contribution to journalArticlepeer-review


The need for data integration is becoming ubiquitous and encompasses many disciplines due to the technological development in instrumentation. Combining the information from distinct data sources in modeling, so as to improve the prediction accuracy and have a holistic view of the problem is a challenge for statisticians. In this paper, we present a flexible statistical framework for integrating various types of data from distinct sources through model-based boosting (IMBoost) with two types of base models: regression trees and penalized splines. The performance of IMBoost is illustrated through two recent studies in environmental soil science, where multiple sensors were used to quantify several soil parameters. Empirical results are promising and show the proposed algorithms substantially improve the prediction performance through combining the strength from distinct data sources. We also proposed a surrogate model approach, which allows IMBoost to handle situations when partial samples are missing from distinct sources.

Original languageEnglish
Article number400
JournalSN Computer Science
Issue number5
StatePublished - Sep 2021


  • Boosting
  • Data integration
  • Missing data
  • Penalized splines
  • Regression tree


Dive into the research topics of 'Data Integration Using Model-Based Boosting'. Together they form a unique fingerprint.

Cite this