LHNCBC: Document Abstract

|

|

FAQs


	Home
	Welcome
	Organization
	Visitor Information
	Staff Directory

	Medical Informatics
	Language & Knowledge Processing
	Image Processing
	Information Systems
	Infrastructure Research
	Multimedia Visualization

	Published Articles
	Technical Reports
	Lectures

	Training Opportunities
	Employment Opportunities

LHNCBC: Document Abstract

Year: 2000	Download Free Adobe Acrobat Reader
LHNCBC-2000-060
Extracting the Lowest-Frequency Words: Pitfalls and Possibilities
Weeber M, Vos R
Computational Linguistics. 2000;26(3):301-17.
In a medical information extraction system, we use common word association techniques to extract side-effect-related terms. Many of these terms have a frequency of less than five. Standard word-association-based applications disregard the lowest-frequency words, and hence disregard useful information. We therefore devised an extraction system for the full word frequency range. This system computes the significance of association by the log-likelihood ratio and Fisher's exact test. The output of the system shows a recurrent, corpus-independent pattern in both recall and the number of significant words. We will explain these patterns by the statistical behavior of the lowest-frequency words. We used Dutch verb-particle combinations as a second and independent collocation extraction application to illustrate the generality of the observed phenomena. We will conclude that a) word-association-based extraction systems can be enhanced by also considering the lowest-frequency words, b) significance levels should not be fixed but adjusted for the optimal window size, c) hapax legomena, words occurring only once, should be disregarded a priori in the statistical analysis, and d) the distribution of the targets to extract should be considered in combination with the extraction method.
PDF

Lister Hill National Center for Biomedical Communications
U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health, Department of Health & Human Services
Copyright, Privacy, Accessibility, Freedom of Information Act
USA.gov, Applications & Plug-Ins
Site last updated: 30 January 2009