Note to users. If you're seeing this message, it means that your browser cannot find this page's style/presentation instructions -- or possibly that you are using a browser that does not support current Web standards. Find out more about why this message is appearing, and what you can do to make your experience of our site the best it can be.


Science 10 February 1995:
Vol. 267. no. 5199, pp. 843 - 848
DOI: 10.1126/science.267.5199.843

Articles

Gauging Similarity with n-Grams: Language-Independent Categorization of Text

Marc Damashek 1

1 Department of Defense, Fort George G. Meade, MD 20755-6000, USA.

A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.


THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES:
Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval.
C. G. Figuerola, R. Gomez, and E. L. de San Roman (2000)
Journal of Information Science 26, 461-467
   Abstract »    PDF »
Computer-Supported Content Analysis: Trends, Tools, and Techniques.
W. Evans (1996)
Social Science Computer Review 14, 269-279
   Abstract »    PDF »
Schoolbook Simplification and Its Relation to the Decline in SAT-Verbal Scores.
D. P. Hayes, L. T. Wolfer, and M. F. Wolfe (1996)
American Educational Research Journal 33, 489-508
   Abstract »    PDF »



To Advertise     Find Products


Science. ISSN 0036-8075 (print), 1095-9203 (online)