Sustainability of Digital Formats
 Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive

Text >> Curator's View

Introduction
To some, it may seem incongruous to see the term "curator's view" in relation to the digital equivalent of the print volumes on most library shelves, but it serves to emphasize that the Library of Congress needs to consider all stages in the life cycle for digital content as it plans to take custody of the output of current publication processes. [For a view of the processes involved in caring for digital content, see Digital Life Cycle Management.] In the Library of Congress, and probably in most libraries, the process of selection or recommendation for print books and journals is independent of the processes for sustaining and making available printed volumes. Because processes for shelving, binding, and circulation are well-understood, routine, and part of the infrastructure, the requirements associated with a particular book or group of journal or newspapers do not usually have to be considered explicitly at the point of selection. Nor does the "functionality" of a printed encyclopedia or directory warrant separate consideration from that of a printed novel. For digital content, the processes necessary for responsible custodianship and support of user expectations are still being shaped. Hence, it is important that the implications for the future of receiving content in a particular digital format are taken into account at the time of selection; what will be necessary to make this content available to users in a convenient and useful way; what will be necessary to ensure that this content can be sustained so that future generations of users can have access despite the rapidly changing technological environment. In other words, the institution must take a curatorial view to decisions about adding content to its collections, a view that is familiar to the divisions that maintain custody of rare or non-print materials.

The practicalities of publishing mean that many textual works are produced in workflows that do not necessarily permit variation in outcome. Publishers may often not be able to supply content in one of the preferred formats for a given category of textual content. The practicalities of work carried out by individuals and small organizations may also limit the range of formats for such things as writings within personal papers ("manuscript") collections or transcripts of oral history projects and the formats in which the content is available may not be sustainable. For example, binary word-processing or desktop-publishing files are unlikely ever to be considered preferred or acceptable formats. For the content of such files to be sustainable, the Library may be best served by requesting that content be transformed into a preferred format before transmission or that the Library be given permission to take steps necessary to sustain the content, including transforming to more sustainable formats. Since text files are usually small (in comparison with audio, video, and even still images), it may often be appropriate to plan to retain the files both in their source format and in a format or formats for which the probability of sustainability in the face of technological obsolescence is higher.

This discussion focuses on digital formats appropriate for individual works consisting primarily of text that originated in digital form, including digital text intended as the basis for a printed document. Not discussed here are the important additional metadata requirements for serial publications or supplementary non-text materials. The format sustainability analysis also focuses on digital formats themselves and not on resource needs associated with receiving digital content. The website will be extended to indicate tools that can be used to validate incoming content or transcode it to a more sustainable format.

For certain textual content, the Library may be offered digital files that contain bitmapped page images, where a sequence of digital page images reproduces a sequence of original paper pages. Examples that have occurred include output from another institution's reformatting program (scanned from paper) and oral history transcripts (scanned from the printed output of a word processor). Offers of content in bitmapped form demand careful scrutiny. Bitmapped page images cannot support normal rendering for textual content, which includes searchability of the text and excerption for quotation. If the offered work is a bitmapped output from a word-processing or other machine-readable text file, the Library should make every effort to acquire the source format instead. If the work is reformatted from a paper original, the Library prefers formats that allow individual page images to be managed, including such actions as applying OCR and transcoding in support of format migration. PDF files created from bitmapped page images are not acceptable unless it is technically and legally feasible for LC to extract the individual images, and any OCR text associated with those pages. The images should be at a quality that will yield reasonable text through OCR. In all cases, the acceptability of bitmapped page images depends on the availability of structural metadata or markup that reflects the sequencing and numbering of pages and relates page images to any corresponding text (OCR or transcribed).

Aspects to consider in digital formats in use today
The various candidate formats used currently for representing textual materials tend to emphasize different characteristics and support for different functionalities. When selecting the appropriate format, the relative importance of different aspects should be considered.

How important is it that the format represent the explicit logical document structure to permit a variety of rendered views or to support navigation that exploits the logical structure?
In general, documents represented through platform-independent structural markup (e.g., in XML or SGML using a well-known DTD or Schema) will be more easily re-purposed and migrated to new technical environments than documents represented in a form intended for a particular output device. For reference works, such as directories, dictionaries, or encyclopedias, the logical structure is paramount; for most other textual genres (articles, text-books, etc.) structured markup would be desirable if available. However, relatively few documents are currently created, stored, or disseminated by authors or publishers with structural markup and there is no reliable, general way to convert documents from a representation that specifies layout to one that makes the logical structure explicit. If text is available in SGML, that probably indicates structural markup; if available in XML, the DTD or Schema in use should reveal whether the tagging indicates the document structure or page layout. HTML markup for an individual page usually represents layout rather than document structure, although a cluster of linked HTML pages may convey logical structure implicitly. PDF files can be created to represent the logical structure of a document but, in practice, seldom are.

How important is it to retain the visual integrity of how the document originally appeared, including layout, font, etc.?
This aspect may be deemed predominantly significant for works where the visual layout conveys non-linear relationships among textual elements or a document is explicitly intended for one-time publication on paper (e.g. for a poster or advertisement). However, for many documents intended for multiple renderings (on paper, on screen, etc.), the exact layout, font, margins and page-breaks, etc., or whether there are footnotes or endnotes, are not important to either author or reader. Many rendered variants would be considered equivalent in all significant ways. For this reason, visual integrity may be less important than document structure for many scholarly articles and technical reports. If visual integrity is the most significant aspect of a work, PDF (which essentially prints the document to pages) may be an appropriate format, preferably the emergent PDF/A or the ISO PDF/X standard for pre-press interchange. Visual integrity for a document marked up in XML or SGML is achieved by having an appropriate stylesheet to accompany the documents. Visual integrity for a document in HTML requires that any stylesheet referred to in the HTML file is available.

Does the document's meaning depend upon the integrity of the information content in non-textual visual elements, such as mathematical equations,diagrams, and illustrations?
For scientific communication, the rendering of equations and diagrams are often more important to authors and readers than any other visual aspects of a work. Although the use of XML as a markup language is growing, it is not currently possible to represent of equations using XML-based markup and render to paper or screen in ways that are acceptable to professional scientists. Complex tables can also be problematic, particularly if they are too large to fit on a single screen or page. Diagrams may need higher resolution than today's computer displays to be fully legible; this argues for a format that can support scalable images that can be viewed online or printed at higher resolution.

The tables presented below suggest how format preferences might be determined for a body of text content. Based on the relative important of the aspects discussed above, four primary subcategories of content have been tentatively identified as useful when format choices are being made. These are listed, with examples, in the first table. The second table illustrates how analysis into subcategories would be combined with technical information about formats to produce a set of format-preference statements for the various content subcategories. The second table includes a few special subcategories: talking books, e-mail, and book-like works where the focus is the illustrations.

Table 1: Significant characteristics and textual content subcategories

  Description Document Structure Layout Rendering of Mathematics, etc. Examples
T1 Textual works where document structure and navigation features are of primary significance. Very important Less important Not important • dictionaries, encyclopedias, directories
• most books, technical reports
• dramatic works, play scripts, film scripts
• all text works in subcategories for which a structural markup schema exists for industry exchange exists, e.g., newsfeeds
• oral history transcripts
T2 Short textual works with simple document structure.

Note: Articles within serials present substantial additional requirements for accompanying metadata to permit linking of articles to parent issues and volumes, and search or navigation through articles within a particular serial titles.
Important Less important Not important • articles
• essays
• transcriptions of speeches
T3 Works with textual content in which layout and design are of primary significance. Less important Very important Not important • brochures
• posters
• advertisements
• children's books
• snapshots of web pages as examples of promotional materials
T4 Works in which information content in non-text visual elements (e.g. equations, diagrams) is of primary significance. This characteristic (when known to be significant for a particular body of content) may override others Important Important Very important • articles and technical reports in scientific disciplines where equations are used, e.g., physics, mathematics, chemistry
• technical reports with engineering diagrams.

Back to top

Table 2: Format preferences for textual content subcategories

  Description Preferred formats Acceptable formats
  Encoding type File type, subtype Encoding type File type, subtype
T1 Works where document structure and navigation features are of primary significance.

Includes works prepared in digital form for ultimate rendering on paper and works intended for online access. (See also type T5 below.)
Structural markup (with supporting image files for illustrations, etc.)

• XML using standard DTD or Schema (e.g. Open eBook)
• XML or SGML using DTD or Schema agreed by LC as acceptable [2]
Page-layout/rendering • HTML (hierarchy or network of linked pages) or
• PDF [1] (if linear structure is appropriate)
T2 Short works with simple document structure.

Includes works prepared in digital form for ultimate rendering on paper and works intended for online access. (See also type T5 below.)
Structural markup or
Page-layout/rendering
• XML using standard DTD or Schema (e.g. Open eBook)
• XML or SGML using DTD or Schema agreed by LC as acceptable [2]
• PDF [1]
Page-layout/rendering • HTML
T3 Works in which layout and design are of primary significance.

Includes works prepared in digital form for ultimate rendering on paper and works intended for online access.
Page-layout/rendering with underlying text accessible to search engines. • HTML (hierarchy or network of linked pages)
• PDF [1] (if linear structure is appropriate)
Structural markup (with supporting image files for illustrations, etc.) • XML using standard DTD or Schema (e.g. Open eBook [2] or Digital Talking Book [3]) and supplied with XSL stylesheet for rendering for online access
• XML using DTD or Schema agreed by LC as acceptable. [2] XSL stylesheet for rendering for online access required.
T4 Works in which information content in non-text visual elements (e.g. equations, diagrams) is of primary significance.

Page-layout with underlying text accessible to search engines. • PDF [1] Structured markup with subsidiary graphical elements XML using DTD or Schema and mechanisms for rendering equations, diagrams, etc. agreed by LC as acceptable [2]
T5 Talking books for the blind. Similar to T1 but with added features [3] Structural markup (with optional SMIL support for sound and images files) • XML using ANSI/NISO Z39.86    
T6 "Source" email, i.e., Internet email as transmitted rather than as displayed in any particular mail client. • ASCII text • RFC 2822    
T7 Textual works where illustrations are as significant or more significant than the text.

Examples: Art books
Use categories above in conjunction with table for Image Content Categories and Format Preferences.      

Notes:

1. With the aim of simplicity, this table does not distinguish between versions of PDF, PDF/X (adopted as an ISO standard for pre-press interchange), or PDF/A (proposed as a standard set of restrictions on PDF files to make them suitable for archiving). The Library of Congress is participating in the development of the PDF/A standard and expects to indicate that, in general, the PDF/A format will be preferred over other PDF variants.

2. Standard XML DTDs include the Open eBook (see http://www.openebook.org/specs.htm) and the Digital Talking Book (see http://www.loc.gov/nls/z3986/). An example of an XML DTD that might be agreed as acceptable while not beign a formal standard is one that follows the Archiving and Interchange DTD model described at http://dtd.nlm.nih.gov/

3. A NISO Digital Talking Book (DTB) is envisioned to be, in its fullest implementation, a group of digitally-encoded files containing an audio portion recorded in human speech; the full text of the work in electronic form, marked with the tags of a descriptive markup language; and a linking file that synchronizes the text and audio portions. Although the standard was developed for the explicit purpose of supporting the blind, visually impaired, and physically handicapped, the resulting format ranks highly on sustainability factors and for any text where integrity of document structure and navigation is paramount. The specification is available at http://www.loc.gov/nls/z3986/

Back to top

Last Updated: 03/ 7/2007