JSM Preliminary Online Program
This is the preliminary program for the 2009 Joint Statistical Meetings in Washington, DC.

The views expressed here are those of the individual authors
and not necessarily those of the ASA or its board, officers, or staff.


Back to main JSM 2009 Program page




Activity Number: 319
Type: Contributed
Date/Time: Tuesday, August 4, 2009 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Computing
Abstract - #304472
Title: Detecting Anomalous Documents in a Corpus-Driven Language Model
Author(s): Kristin Yancey*+ and Elizabeth L. Hohman
Companies: Naval Surface Warfare Center and Naval Surface Warfare Center
Address: , , ,
Keywords: text processing ; language identification ; anomaly detection
Abstract:

Given a corpus of documents purporting to be in a single language, we develop a methodology for detecting documents written in a different language. Unlike previous work in language identification, the methodology does not assume any prior knowledge specific to either language. Information theoretic methods are used to determine key terms that are highly predictive of language identity, and a language model is then built from the keywords and compared to models developed from each document using rank order statistics, cosine dissimilarity, and other approaches. We evaluate the different methods of constructing and ranking the models on a corpus of multilingual Wikipedia articles.


  • The address information is for the authors that have a + after their name.
  • Authors who are presenting talks have a * after their name.

Back to the full JSM 2009 program


JSM 2009 For information, contact jsm@amstat.org or phone (888) 231-3473. If you have questions about the Continuing Education program, please contact the Education Department.
Revised September, 2008