A robust framework for classifying evolving document streams in an expert-machine-crowd setting

Muhammad Imran, Sanjay Chawla, Carlos Castillo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-To-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and outlier detection by formulating a novel optimization problem: COD-Means. We design an algorithm to solve the COD-Means problem and show that COD-Means will not only help detect novel categories but also seamlessly discover human annotation errors and improve the overall quality of the categorization process. Experiments on diverse real data sets demonstrate that our approach is both effective and efficient.

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
EditorsFrancesco Bonchi, Xindong Wu, Ricardo Baeza-Yates, Josep Domingo-Ferrer, Zhi-Hua Zhou
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages961-966
Number of pages6
ISBN (Electronic)9781509054725
DOIs
StatePublished - Jan 31 2017
Externally publishedYes
Event16th IEEE International Conference on Data Mining, ICDM 2016 - Barcelona, Catalonia, Spain
Duration: Dec 12 2016Dec 15 2016

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Conference

Conference16th IEEE International Conference on Data Mining, ICDM 2016
Country/TerritorySpain
CityBarcelona, Catalonia
Period12/12/1612/15/16

Keywords

  • Novel concept detection
  • Outlier detection
  • Social media
  • Stream classification
  • Text classification

Fingerprint

Dive into the research topics of 'A robust framework for classifying evolving document streams in an expert-machine-crowd setting'. Together they form a unique fingerprint.

Cite this