Robust record linkage blocking using suffix arrays

Timothy De Vries, Hui Ke, Sanjay Chawla, Peter Christen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

40 Scopus citations

Abstract

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

Original languageEnglish
Title of host publicationACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Pages305-314
Number of pages10
DOIs
StatePublished - 2009
Externally publishedYes
EventACM 18th International Conference on Information and Knowledge Management, CIKM 2009 - Hong Kong, China
Duration: Nov 2 2009Nov 6 2009

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

ConferenceACM 18th International Conference on Information and Knowledge Management, CIKM 2009
Country/TerritoryChina
CityHong Kong
Period11/2/0911/6/09

Keywords

  • Blocking
  • Record linkage
  • Suffix arrays

Fingerprint

Dive into the research topics of 'Robust record linkage blocking using suffix arrays'. Together they form a unique fingerprint.

Cite this