NIST
Special Database 11
NIST Census
Miniform Training Database 1: Binary Images from Microfilm
NIST
Special Database 11 is a set of 1990 Census
Miniform images. A Miniform is a non-sensitive portion of the Industry
and Occupation section of an actual Census Long Form with handwritten
responses to three questions.
The database is
available on CD-ROM and contains images of 13,500 miniforms (40,500
fields) and files containing ASCII transcriptions of the strings
that were written in the miniform fields. This database is designed for
the evaluation of optical character recognition (OCR) systems in a difficult
but realistic form-based task on binary images from microfilm.
Each miniform image
contains three fields with handwritten answers to the following questions
(Long Form Questions 28b, 29a, and 29b respectively).
- Describe the activity
performed at location where employed.
- What kind of work
was this person doing?
- What were this
person's most important activities or duties?
A possible set of responses
would therefore be:
- hospital
- registered nurse
- patient care
The forms were scanned
from microfilm, yielding images of far lesser quality than forms scanned
from paper.
The images are 624
by 744 pixels sampled at 78.74 pixels/cm (200 pixels/inch). They are packed
five to a file and are CCITT Group 4 compressed..
Source code for
image manipulation, including programs to uncompress and unpack the images,
is present on CD-ROM. The code is written in the C programming language
and was developed on Sun workstations running SunOS 4.1.1.*
Special Database
11 was the first of three produced in conjunction with The Second Census
Optical Character Recognition Systems Conference, and was intended for
system training. (The second, Special Database 12, contained paper and
microfilm training data. The third, Special Database 13 contained the
paper and microfilm data used for the actual system testing).
NIST and the Bureau
of the Census sponsored the Conference in which the participants sought
to determine the state of the art of the OCR industry on a challenging,
realistic task. The results of the Conference were published in NIST Internal
Report (IR) 5452. That report is available on the Internet in PostScript
form via anonymous FTP from the server sequoyah.ncsl.nist.gov,
maintained by NIST's Visual Image Processing Group. It is also available
on request in hardcopy form.
Special Database
11 comes with a 29-page guide that presents an overview of the Conference
and its results and documents the file formats and software.
*Specific hardware
and software products identified were used in order to adequately support
the development of the technology described in this document. In no case
does such identification imply recommendation or endorsement by the National
Institute of Standards and Technology, nor does it imply that the equipment
identified is necessarily the best available for the purpose.
Price:
$90.00
Special pricing for
multiple copies available. Call for details.
Please click here
to view the PDF version of Users' Guide.
For
more information please contact:
- Standard Reference
Data Program
National Institute of Standards and Technology
100 Bureau Dr., Stop 2300
Gaithersburg, MD 20899-2310
(301) 975-2008 (VOICE) / (301) 926-0416 (FAX)
Contact Us
The
scientific contact for this database is:
- Stanley Janet
National Institute of Standards and Technology
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
(301) 975-2916
e-mail: stan.janet@nist.gov
Keywords:
Automated character recognition; automated data capture; character recognition;
forms recognition; handwriting recognition; OCR; optical character recognition;
software recognition.
|