Identification of deliberately doctored text documents using Frequent Keyword Chain (FKC) model

Siddharth Kaza, S. N.Jayaram Murthy, Gongzhu Hu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords ('deliberate doctoring') belonging to a class different from that of the document. Such intentional deception is done in order to rank web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the Frequent Keyword Chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC's in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the Frequency Keyword Chain model. Experiments on Newsgroup data sets show that this model is effective.

Original languageEnglish
Title of host publicationProceedings of the 2003 IEEE International Conference on Information Reuse and Integration, IRI 2003
EditorsWaleed W. Smari, Atif M. Memon
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages398-405
Number of pages8
ISBN (Electronic)0780382420, 9780780382428
DOIs
StatePublished - 2003
EventIEEE International Conference on Information Reuse and Integration, IRI 2003 - Las Vegas, United States
Duration: Oct 27 2003Oct 29 2003

Publication series

NameProceedings of the 2003 IEEE International Conference on Information Reuse and Integration, IRI 2003

Conference

ConferenceIEEE International Conference on Information Reuse and Integration, IRI 2003
Country/TerritoryUnited States
CityLas Vegas
Period10/27/0310/29/03

Keywords

  • Doctored document detection
  • Frequent keywords
  • Text document classification

Fingerprint

Dive into the research topics of 'Identification of deliberately doctored text documents using Frequent Keyword Chain (FKC) model'. Together they form a unique fingerprint.

Cite this