Mining a corpus of biographical texts using keywords

Mike Conway

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywords-and the associated concepts of 'keyness' and 'key-keyness'-have inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the naïve Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.

Original languageEnglish
Article numberfqp035
Pages (from-to)23-35
Number of pages13
JournalLiterary and Linguistic Computing
Volume25
Issue number1
DOIs
StatePublished - Oct 6 2009
Externally publishedYes

Fingerprint

Dive into the research topics of 'Mining a corpus of biographical texts using keywords'. Together they form a unique fingerprint.

Cite this