NIST Scientific and Technical Databases NIST Scientific and Technical Databases NIST Homepage Databases
Data Home

Analytical Chemistry

Atomic and Molecular Physics

Biometrics

Biotechnology

Chemical and Crystal Structure

Chemical Kinetics

Chemistry

Communications

Construction

Environmental Data

Fire

Fluids

International Trade

Law Enforcement

Materials Properties

Mathematical Databases, Software and Tools

Optical Character Recognition

Physics

Product Design

Surface Data

Text and Video Retrieval

Thermophysical and Thermochemical

 

thin vertical line

NIST Special Database 11

NIST Census Miniform Training Database 1: Binary Images from Microfilm

Price: $90.00 Link to the Online Purchase Order FormLink to a FAX or Mail Order Form

NIST Special Database 11 is a set of 1990 Census Miniform images. A Miniform is a non-sensitive portion of the Industry and Occupation section of an actual Census Long Form with handwritten responses to three questions.

The database is available on CD-ROM and contains images of 13,500 miniforms (40,500 fields) and files containing ASCII transcriptions of the strings that were written in the miniform fields. This database is designed for the evaluation of optical character recognition (OCR) systems in a difficult but realistic form-based task on binary images from microfilm.

Each miniform image contains three fields with handwritten answers to the following questions (Long Form Questions 28b, 29a, and 29b respectively).

  • Describe the activity performed at location where employed.
  • What kind of work was this person doing?
  • What were this person's most important activities or duties?
A possible set of responses would therefore be:
  • hospital
  • registered nurse
  • patient care

The forms were scanned from microfilm, yielding images of far lesser quality than forms scanned from paper.

The images are 624 by 744 pixels sampled at 78.74 pixels/cm (200 pixels/inch). They are packed five to a file and are CCITT Group 4 compressed..

Source code for image manipulation, including programs to uncompress and unpack the images, is present on CD-ROM. The code is written in the C programming language and was developed on Sun workstations running SunOS 4.1.1.*

Special Database 11 was the first of three produced in conjunction with The Second Census Optical Character Recognition Systems Conference, and was intended for system training. (The second, Special Database 12, contained paper and microfilm training data. The third, Special Database 13 contained the paper and microfilm data used for the actual system testing).

NIST and the Bureau of the Census sponsored the Conference in which the participants sought to determine the state of the art of the OCR industry on a challenging, realistic task. The results of the Conference were published in NIST Internal Report (IR) 5452. That report is available on the Internet in PostScript form via anonymous FTP from the server sequoyah.ncsl.nist.gov, maintained by NIST's Visual Image Processing Group. It is also available on request in hardcopy form.

Special Database 11 comes with a 29-page guide that presents an overview of the Conference and its results and documents the file formats and software.

*Specific hardware and software products identified were used in order to adequately support the development of the technology described in this document. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment identified is necessarily the best available for the purpose.

Price: $90.00 Link to the Online Purchase Order FormLink to a FAX or Mail Order Form

Special pricing for multiple copies available. Call for details.

Please click here to view the PDF version of Users' Guide.

For more information please contact:

Standard Reference Data Program
National Institute of Standards and Technology
100 Bureau Dr., Stop 2300
Gaithersburg, MD 20899-2310


(301) 975-2008 (VOICE) / (301) 926-0416 (FAX)
Contact Us

The scientific contact for this database is:
Stanley Janet
National Institute of Standards and Technology
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
(301) 975-2916
e-mail: stan.janet@nist.gov

Keywords: Automated character recognition; automated data capture; character recognition; forms recognition; handwriting recognition; OCR; optical character recognition; software recognition.


[Online Databases] [New and Updated Databases]
[Database Price List] [JPCRD] [CODATA] [FAQ] [Comments] [NIST] [Data]

Create Date: 6/02
Last Update: Thursday, 15-Mar-07 13:06:10
Contact Us