The Nineteenth Century in Print: Periodicals

About the Text Generated by Optical Character Recognition without Correction

The periodicals in this collection were scanned at 600 dots per inch and captured as bitonal (black and white, not greyscale) TIFF images. The text for the twenty-two periodicals converted by Cornell University Library was generated by a fully automated process of optical character recognition, with no human intervention beyond initial calibration. The OCR process was implemented by Cornell University Library staff. A similar process was used by the University of Michigan Digital Library Production Service to prepare searchable text for Garden and Forest for the Library of Congress.

Why are there strange characters and odd spacing in the text?
The text was generated automatically by optical character recognition (OCR). OCR works best on uniform clear print. You will notice problems with small print, special fonts, and decorated text. Strange characters, particularly ~, occur in places where the OCR process determined that a character was present but not what character it was. Strange spacing, such as a sequence of blank lines, may occur when pages have illustrations. Decorative blocks, such as the title block on the front page of each issue of Garden and Forest, also cause problems for OCR. In some cases, the columns on the page were not recognized as separate sequences of text.

Why was the text not re-keyed?
The advantage of OCR over re-keying is in the cost. The cost of fully automated OCR is around 15 cents per page; the cost of re-keying is determined by the number of characters. These pages are dense with words and re-keying to the 99.95% accuracy rate required for most Library of Congress projects to date would certainly cost a dollar or two per page. Human correction of OCR would probably cost at least 4 or 5 times as much as the OCR itself. The per page difference in cost becomes significant for a collection of around 750,000 pages.

Why not simply use page images?
Most users will want to view the page images for reading. The converted text is primarily to support searching. Search the full text of a periodical for a place, say Chesapeake Bay. [Choose "match this exact phrase" as an option, and "match words exactly," rather than "include word variants."] Imagine using microfilm or even browsing the original paper issues. Would you find these items? When you search the full text, links from result lists will take you to the uncorrected text. Often, there will be a Best Match button at the top (beside the American Memory logo). This will take you to the section of the retrieved text with the best or most matches to the words you entered. View page links will take you to the image of the selected page. You will be able to browse backwards and forwards through the volume.


Return to The Nineteenth Century in Print: Periodicals