Analytical
Chemistry
Atomic
and Molecular Physics
Biometrics
Biotechnology
Chemical
and Crystal Structure
Chemical
Kinetics
Chemistry
Communications
Construction
Environmental
Data
Fire
Fluids
International
Trade
Law
Enforcement
Materials
Properties
Mathematical
Databases, Software and Tools
Optical
Character Recognition
Physics
Product
Design
Surface
Data
Text
and Video Retrieval
Thermophysical
and Thermochemical
|
|
National
Institute of Standards and Technology
NIST Special Database 25
NIST
Federal Register Document Image Database: Volume 1
Price:
$90.00
NIST has produced
a new document image database for evaluating document analysis and recognition
technologies and information retrieval systems. NIST Special Database
25 contains page images from the 1994 Federal Register and much more.
A new, fully-automated
process developed at NIST was used to derive ground truth for document
images. The method involves matching optical character recognition (OCR)
results from a page with typesetting files for an entire book. Public
domain software for deriving ground truth is provided in the form of Perl
scripts and C source code, and includes new, more efficient string alignment
technology and a word-level scoring package. The documentation includes
a complete software reference guide, including online manual pages. With
this ground truthing technology, it is now feasible to produce much larger
data sets, at much lower cost, than was ever possible with previous labor-intensive,
manual data collection projects.
There were roughly
250 issues, comprised of nearly 69,000 pages, published in the Federal
Register in 1994. This volume of the database contains the pages of 20
books published in January of that year. The database includes scanned
images, SGML-tagged ground truth text, commercial OCR results, and image
quality assessment results. These data files are useful in a wide variety
of experiments and research. Future volumes may be released, depending
on the level of interest.
This volume of the
database contains 4711 page images scanned binary at 15.75 pixels per
millimeter (400 pixels per inch). The images are stored in the NIST IHead
format and are compressed using CCITT Group 4 compression. Documentation
for this format and source code for reading and writing IHead images is
provided. Of these page images, 4519 of them have corresponding ground
truth.
This volume is distributed
on two ISO-9660 CD-ROMs utilizing 1.27 gigabytes of storage.
Source code used
to create this data is available in
sd25_src.tar.Z
Examples from this
database are located at the anonymous FTP site sequoyah.nist.gov at: sd25.tar
Price:
$90.00
Please click here
to view the PDF version of Users' Guide
For
ordering information contact:
Standard Reference
Data
National Institute of Standards and Technology
100 Bureau Dr., Stop 2300
Gaithersburg, MD 20899-2310
Voice: (301)975-2008
Email: Contact Us
FAX: 301-926-0416
Technical
contact:
Michael D. Garris
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
Email: mgarris@nist.gov
Voice: (301)975-2928
Keywords:
document image database; OCR; optical character; recognition technology;
|