The records included in each version of the Test Collections
represent a static view of the data at the time each Test Collection
was created. For example, the "Original 1999 Indexing and Format
(January 20, 1999) Version" Test Collection represents a static view of
PubMed/Medline as of January 20, 1999. There has been no reformatting of
the text, or any updating of MeSH Indexing done to these records.
200 MEDLINE Citations Test Collection
Test Collection used in our original experiments, tuning parameters
phase, and now used to track improvements to MTI. We have included
the original test collection from January 20, 1999 to allow comparison
based on the actual data from that time. We have also included an
updated version of the Test Collection to allow for more current
comparison studies involving the use of PMIDs and current MeSH Indexing.
Original
PubMed Central ASCII MEDLINE Format (March 22, 2005) Version(1.4 mb)
This is a text file containing abstracts in ASCII MEDLINE
format from PubMed Enrez for 498 of the 500 articles. This
file contains the MeSH Indexing used for comparison purposes
in the above mentioned paper. Two of the PMIDs in the test
collection have PMIDs of "0" and do not have indexing in this
file.
Pseudo-ASCII MEDLINE
Formatted Label Break-out (February 6, 2004) Version(15.4 mb)
This is a single file containing all 500 articles put into
a pseudo-ASCII MEDLINE format which is required for MTI. This
file differs from the above in that the "important" sections
(which might have separate sub-sections)
have been separated in the article and a new "citation"
associated with the article PMID and label created. For
example, "Background", "Methods", "Results", "Discussion",
"Conclusions", etc. With the base abstract and title listed
separately and first for each article.
Example (PMID 11884248): PMC file shows
<abs><sec><st>
<p>Abstract</p></st>
<p>Background</p></st>
is translated to "AB - Abstract | Background | " in the
pseudo-ASCII MEDLINE Break-out version as the main section
"Abstract" contains a sub-section "Background".
Example II (PMID 11884248): PMC file shows
</abs></fm><bdy><sec><st>
<p>Background</p></st>
is translated to
"PMID- 11884248_Background" in the
pseudo-ASCII MEDLINE Break-out version as a new section in
the article is identified as "Background", so we create a
new "citation" using the same PMID and the new section name
as the identifier.
Word Sense Disambiguation (WSD) Test
Collection
The test collection consists of 50 highly frequent ambiguous UMLS concepts
from 1998 MEDLINE. Each of the 50 ambiguous cases has 100 ambiguous
instances randomly selected from the 1998 MEDLINE citations. For a total
of 5,000 instances. We had a total of 11 evaluators of which 8 completed
100% of the 5,000 instances, 1 completed 56%, 1 completed 44%, and the final
evaluator completed 12% of the instances. Evaluations were only used when the
evaluators completed all 100 instances for a given ambiguity.
The following paper describes in more detail the development of the test
collection: