NIST Scientific and Technical Databases NIST Scientific and Technical Databases NIST Homepage Databases
Data Home

Analytical Chemistry

Atomic and Molecular Physics

Biometrics

Biotechnology

Chemical and Crystal Structure

Chemical Kinetics

Chemistry

Communications

Construction

Environmental Data

Fire

Fluids

International Trade

Law Enforcement

Materials Properties

Mathematical Databases, Software and Tools

Optical Character Recognition

Physics

Product Design

Surface Data

Text and Video Retrieval

Thermophysical and Thermochemical

 

thin vertical line

NIST Special Database 22

NIST TREC Document Database: Disk 4

  

PRICE: $90.00 for disk 4Link to Online Purchase Order FormLink to FAX or Mail Order Form

NIST TREC Document Databases are distributed for the development and testing of information retrieval (IR) systems and related natural language processing research. The document collections consist of the full text of various newspaper and newswire articles plus government proceedings.The documents have been used to develop a series of large IR test collections known as the TREC collections (http://trec.nist.gov). An IR test collection consists of three parts: a set of documents, a set of questions (called topics in TREC) that can be answered by some of the documents, and the right answers (called relevance judgments) that list the documents that are relevant to each question. The topics and relevance judgments for the TREC collections are available separately at (http://trec.nist.gov/data.html). Three other disks of documents, known as the "Tipster" disks were used in early TREC workshops. The TIPSTER disks may be purchased from the Linguistic Data Consortium (LDC) ( www.ldc.upenn.edu ).

The format of the documents on the TREC disks use a labeled bracketing expressed in the style of SGML (Standard Generalized Markup Language). SGML DTD's are included on each disk. The different datasets on the disks have identical major structures but have different minor structures. Every document is bracketed by <DOC></DOC> tags and has a unique document identifier, bracketed by <DOCNO></DOCNO> tags. The datasets have all been compressed using the UNIX compress utility and are stored in chunks of about 1 megabyte each (uncompressed size).

The contents of the disks are as follows:

TREC Disk 4

    Congressional Record of the 103rd Congress
    approx. 30,000 documents
    approx. 235 MB

    Federal Register (1994)

    approx. 55,000 documents
    approx. 395 MB

    Financial Times (1992-1994)

    approx. 210,000 documents
    approx. 565 MB

Some of the data on the disks is copyrighted by the data providers. The data providers have granted permission to use the data for research purposes only.

A signed hard copy of the data use permission form must be received by NIST before a disk can be shipped. For complete instructions click here. Please fax the completed form to the number below.

SYSTEM REQUIREMENTS: CD-ROM drive with software to read ISO-9660 format. Software to uncompress files (e.g., UNIX "uncompress", gunzip).

PRICE: $90.00 for disk 4Link to Online Purchase Order FormLink to FAX or Mail Order Form

Further questions, please contact:

Standard Reference Data Program
National Institute of Standards and Technology
Bldg. 820/Room 113
Gaithersburg, MD 20899

(301) 975-2008 (VOICE)/(301) 926-0416 (FAX)
Contact Us(E-MAIL)
http://www.nist.gov/srd

Scientific Contact:

    Ellen Voorhees
    National Institute of Standards and Technology
    Gaithersburg, MD 20899
    Ph: (301) 975-3761
    ellen.voorhees@nist.gov

Keywords: Information retrieval, text processing, test collections.


[Online Databases] [New and Updated Databases]
[Database Price List] [JPCRD] [CODATA] [FAQ] [Comments] [NIST] [Data]

Create Date: 6/02
Last Update: Thursday, 15-Mar-07 13:06:11
Contact Us