Sustainability of Digital Formats
 Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Introduction >> Overview | Formats, Evaluation Factors, and Relationships | Papers and Presentations | Related Resources

Formats, Evaluation Factors, and Relationships


What is a format?

Working definition
This Web site defines formats as packages of information that can be stored as data files or sent via network as data streams (aka bitstreams, byte streams). For reference, the working definition from the proposed Global Registry of Digital Formats is "A format is a fixed, byte-serialized encoding of an information model."


This site's use of the term format is very broad and embraces the following overlapping sets of entities:
File formats
• at the level indicated by file extensions, e.g., .mp3
• as indicated by Internet MediaType (aka MIME type), e.g., text/html
• versions that develop thru time, e.g., the Aldus Corporation TIFF format version 5.0 was supplanted by Adobe's version 6.0
• refinements are tailored to narrow, specific purposes, e.g., TIFF/EP for electronic photography and TIFF/IT for publishers' preprint requirements, both established as ISO standards
• instances with optional features significant to sustainability , e.g., the programs downloaded from Audible, Inc., (http://www.audible.com/), are copyright-secure audio files in formats that prevent "a customer from passing along duplicate digital audio files to another listener" (http://audible.custhelp.com/cgi-bin/audible.cfg/php/enduser/std_alp.php).
Bitstream encodings that underlie certain file formats, e.g., the linear pulse code modulated (LPCM) waveforms that may be found in WAVE or AIFF files, or H.264 video that may be found in QuickTime or MPEG-4 files. Encodings like LPCM and H.264 are specific to content category (in these cases, audio and video), while others are generic, e.g., UTF-8 and IEEE 754-1985.
Wrappers and bundling formats that include examples ranging from TIFF to ZIP to MXF to METS. This broad, multifaceted category is described in the sidebar on this page.
Classes of related formats whose familial characteristics are important, e.g., the WAVE audio format is an instance of the RIFF format class

Relationships Between Formats

Format name alone insufficient for identification
The preceding sections of this document supply some of the reasons why format preferences will often have to specify more than just the name of a file format.  Preference statements may also benefit from specific recommendations relating to quality or call for descriptive or technical metadata.  In the case of digital photography, for example, the Library may prefer uncompressed or losslessly compressed images to ones that have been compressed, resulting in a reduction of clarity.  Files of textual material that do not support searching of the text would be discouraged.  Content that is well bundled or that contains embedded or accompanying metadata that follows standards or published guidelines will be less costly to prepare for sustaining in a digital repository and integrate into systems for providing user access.

Format relationships
Thus format descriptions developed to serve digital librarians must document relationships among (and within) formats and how they are used in practice, as well as acknowledging the fact of versioning.  This Web site includes a number of Format Descriptions for the Library of Congress that begin to offer information of this kind.  A team will be needed to remain abreast of new developments and update the format descriptions over time.  Meanwhile, it is at this level of specific format description that this Library of Congress activity will have a high level of synergy with the proposed Global Registry of Digital Formats.

What is meant by format relationships?  Here are two examples: the relatively simple example of the WAVE format for sound, and the more complex example of the Portable Document Format (PDF), used for texts and more.

WAVE
• Wrapper for different bitstreams
• Simple, but extensible method for embedding metadata
• Relationship examples
  • is subtype of RIFF
  • may contain Linear PCM, µ-law, A-law (bitstreams)
  • has subtype Broadcast WAVE (Linear PCM + EBU metadata)
  • has subtype AES46-2002 (BWF + cart metadata)

PDF
• A file format, a wrapper, a bundling format, all in one
• Relationship examples
  • has version 1.3  (July 2000, 696 pages)
  • has version 1.4  (December 2001, 978 pages)
  • has version 1.5  (August 2003, 1172 pages)
  • has version 1.6  (November 2004, 1213 pages)
  • has version 1.7  (October 2006, 1310 pages)
  • may contain TIFF, JPEG, JPEG2000, etc., etc., etc. (all at once)
  • has subtype Tagged PDF (can represent logical document structure)
  • has subtype Accessible PDF (tagged + further constraints)
  • has subtype PDF/X (ISO standard, for pre-press use, e.g., submission of graphics to magazine publishers)
  • has subtype PDF/A (Under development as ISO standard 19005, for archiving)

The compilers of this Web site believe that there is value in classifying format subtypes.  As the PDF examples indicate, subtypes may express different kinds of features, while in some instances, a file may be a member of more than one subtype class.

Back to top

Factors to Consider when Evaluating a Digital Format

Sustainability factors
Sustainability factors apply across digital formats for all categories of information. The seven factors listed below influence the feasibility and cost of preserving content in the face of future changes to the technological environment in which users and archiving institutions operate. These factors are significant whatever strategy is adopted as the basis for future preservation actions: migration to new formats, emulation of current software on future computers, or a hybrid approach. Some important additional considerations, e.g., matters pertaining to the authenticity of a digital item, are attributes of the systems used to manage digital content and not of the content format itself.

Seven sustainability factors
1. Disclosure.  Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content. A spectrum of disclosure levels can be observed for digital formats.  What is most significant is not approval by a recognized standards body, but the existence of complete documentation.
2. Adoption.  Degree to which the format is already used by the primary creators, disseminators, or users of information resources.  This includes use as a master format, for delivery to end users, and as a means of interchange between systems.
3. Transparency.  Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor.
4. Self-documentation.  Self-documenting digital objects contain basic descriptive, technical, and other administrative metadata.
5. External Dependencies.  Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments.
6. Impact of Patents.  Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents.
7. Technical Protection Mechanisms.  Implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository.

Quality and functionality factors
Quality and functionality factors pertain to the ability of a format to represent the significant characteristics of a given content item required by current and future users.  These factors will vary for particular genres or forms of expression for content.  For example, significant characteristics of sound are different from those of still pictures, whether digital or not, and not all digital formats for images are appropriate for all genres of still pictures. 

User expectations establish what might be called normal rendering for a given genre or form of expression for content.  Normal rendering is a baseline for the behavior of content when presented to a user, e.g., images that permit zooming or sounds that can be played, stopped, and restarted.  The detailed discussions of content types listed below (with links to other Web pages) include discussion of the normal rendering expected for that type.


Examples of quality and functionality factors for selected content types
Still Images
  • Clarity (support for high image resolution)
  • Color maintenance (support for color management)
  • Support for graphic effects and typography
Sound
  • Fidelity (support for high audio resolution)
  • Support for multiple channels (including note-based, e.g., MIDI)
  • Support for downloadable or user-defined sounds, samples, and patches
Text
  • Support for integrity of document structure and navigation
  • Support for integrity of layout, font, and other design features
  • Support for rendering for mathematics, formulae, diagrams, etc.
Moving Images
  • Clarity (support for high image resolution)
  • Fidelity (support for high audio resolution)
  • Support for multiple sound channels

Beyond normal rendering
Certain formats offer functionality beyond normal rendering, and these will serve the needs of users with special interests in certain content types.  For example, some users will prefer that vector-based images like those used for architectural drawings remain malleable (editable) so that they can be modified after being copied from a library collection.  Or other users may require that music notation formats, e.g., MIDI, permit the use of a variety of sounds or tone sets to mimic actual instruments or create new tones and timbres.

An additional functional aspect that lies beyond normal rendering arises in the case of what may be called rich-data content.  This occurs when a given item is intended for service as a master, i.e., for use as a source for repurposing.  For example, a "normal" RGB still image carries 8 bits of data per color channel per pixel (24 bits per pixel total).  A rich-data RGB image--sometimes called extended data range or EDR--would carry additional bit depth, e.g., 16 bits per color per pixel (48 bits per pixel total), and may represent brightnesses in linear rather than the more common logarithmic manner.  The full extent of data in such a "rich" image will not be displayed in normal software and display devices.  Special display devices for such images are likely to permit the examination of the data rather than presenting an image with an aesthetically pleasing effect.  But the additional data means that the rich-data image can be manipulated for aesthetic effects and for a variety of output devices.  The manipulated versions of the rich-data image can be saved as "normal" RGB images with 8 bits of data per color per pixel, and these derivative images will have a complete 24-bit spectrum of colors.  Had the master image contained the normal 8 bits per color, then some color values would be missing.  Therefore the rich-data version, with this added funtionality, makes a good choice for long-term preservation.

Similar rich-data items can be produced for sound recordings, moving images, and possibly for other content types.  The full extent of the information in a rich-data sound recording may not be rendered in audible form by normal playback software and hardware but, like the rich-data image, such a recording can be enhanced or modified with fewer ill effects than a recording made at "normal" fidelity.  In the case of moving image content, a rich-data file might contain an uncompressed video stream sampled in what is called the 4:4:4 mode (equal sampling of luminance and two chrominance elements).  The extremely high data rate for such a file (about 270 megabits per second) precludes playing it in most contemporary computer systems.  In a data storage system, such rich-data files would be managed in slower-than-real-time modes, and would have to be written to videotape or saved as compressed files before they could be played.  But their value as preservation masters would be superlative.

Balancing the factors
In practice, preferences among digital formats will be based on a balance among the factors listed above: disclosure, adoption, transparency, self-documentation, external dependencies, impact of patents, technical protection, quality, and functionality. Sometimes these factors compete. For example, some formats adopted widely for delivery of content to end users are proprietary or apply lossy compression for transmission over low-bandwidth networks. Disclosure can substitute for transparency; for example, the developers of the JPEG 2000 format based on wavelet compression are said to have tested the published specification by giving it to several programmers independently and asking them to program a compliant viewer based only on the specification. For content of high cultural value and for which a special functionality has particular significance, the ability of a format to support that functionality may outweigh the sustainability factors.

Also important to the selection of acceptable formats is the channel by which digital content may be received. For content that will be received through the Copyright Office at the Library of Congress, it is important that the list of acceptable formats include formats that can be conveniently provided by those wishing to register material for copyright or from whom the Library will expect deposit. For this channel, adoption may be the key factor, leading to acceptance of content in formats that provide less quality or functionality than would be sought in direct negotiations with a source of digital content. For example, for visual materials registered for copyright as digital images, the formats supported by digital cameras aimed at both professional and consumer markets must be considered. Similarly, for recorded sound, the formats used for widespread online distribution through downloading must be acceptable.

Back to top

Framework for Decision-making

Policy implications
The Library of Congress acquires content in several ways, e.g., through the workings of the copyright law; via purchases, exchanges, or licensing; and by donation.  Included in this consideration are special projects like the Veterans History Project, which bring to the Library documentary materials produced by organizations across the nation, and the Minerva project, which harvests sites from the World Wide Web. 

Acquisition in each mode is guided by collection policies.  Although this Web site is primarily technical, policy matters are implied at many turns in these pages.  For example, the following section of this document notes that the Library frequently collects finished works and occasionally collects works that document the creative process itself, a distinction that suggests the values in play when acquiring digital works. The documents devoted to specific types of content include tables that suggest how curators may categorize works in terms of their essential or significant characteristics, and then proceed to indicate how these categorizations may influence the selection of preferred formats.  These illustrative examples highlight the importance of engaging Library curators to develop collections policies that answer questions like these: "Is this digital image one for which color values are so significant that its acquisition format ought to support color management?" and "Is this music recording one for which future researchers will require surround sound?"

Initial, middle, and final-state formats
In the analog realm, much of what the Library collects is published, the final manifestation of a creative process. The acquisition of works in this final state will continue in the digital realm. The institution's special collections divisions, however, also collect works in other states. First are exemplars of the creative process, e.g., manuscripts and other draft documents or musical scores, i.e., work in its initial state. Although collected only in rare instances, this category may also include raw materials used in the creative process, e.g., the outtakes or leftover footage in a video production or the recorded music tracks that include a musician's mistakes, later expunged from the published manifestation. The Library may also collect works in what might be called a middle state, the form that content takes in the hands of a publisher. In some cases, the middle-state form is what is delivered to the publisher, as exemplified by the PDF/X or TIFF/IT files that a designer may employ when submitting digital art, or the proposed Delivery Recommendations from the NARAS Producers and Engineers Wing, for multi-track sound recordings fresh from the studio, with associated metadata used to produce the final mix. Middle-state formats are likely to be used by publishers for their own archiving.

The best formats for Library of Congress collections for the long term may well be those in the middle state.  These are likely to have higher quality than final-state formats, may be easier to manage for preservation, and may also be the focus of developing archiving approaches by industry.  However, to seek middle-state digital formats would represent a change in the Library's most widespread current practice, which is the selection of best editions as authorized by copyright law.  Best editions are generally considered to be works in their final state.

The value of holding multiple versions of a work
For certain categories or subcategories of content, multiple versions in different formats will be desired by the Library in order to manage items through the content life cycle, and this may create a certain tension when preferred formats are identified. For example, the inspection of arriving content, even in a Copyright Office examination activity, requires ease of access and viewing.  Similarly, easily accessible formats make possible the provision of digital content to readers in the Library's reading rooms.  At the same time, other formats-often richer and with larger files-provide the best option for long-term preservation.  For example, compressed versions of images or sound recordings may be the most facile for access, while their uncompressed counterparts are the most sustainable.  In some instances, in order to meet the need for multiple versions, the Library may have to produce more easily accessible versions itself.

Another tension concerning identifying preferred formats may arise in the context of textual items. Here, the Library may be offered formats like PDF that fit the document creators' wish to control the details of layout, font, or other matters of appearance. This desire must be weighed against the long term needs of those researcher-users whose needs will be best answered by formats that express the structure of the document, e.g., chapter headings and section breaks, in ways that makes this structure available for future automated analysis. A document with this type of structure-for example, using XML to identify the structural elements-can be processed to support future discovery, links from references to the associated documents and, more important, research studies carried out by, say, a social scientist looking for paragraphs or chapters that bear on a certain topic.

Back to top

Project Scope

Preferred and non-preferred formats
When mature, this Web site will identify and provide information about dozens of digital formats that are preferred or acceptable for the Library of Congress. For reference, a recent description of Web harvesting in Sweden and Finland reported 440 distinct formats held by those digital archives, although the report found that 44 formats in eight categories were the "most common" (DAVID: Archiving Websites, page 37, report available from the publications menu at http://www.antwerpen.be/david). Formats are constantly being created and/or evolving, and the Library must be prepared for constant updating of its format preferences, as it must also be prepared to provide technical support for the preferred and acceptable formats.

The Web site will also identify and provide information about formats other than those deemed preferred or acceptable. To the degree possible, the Library must be prepared to transcode or normalize digital content that it wishes to acquire when this content is offered exclusively in formats other than the preferred or acceptable. A future extension of this Web site will concern the identification of software tools for format transcoding.  Even with automated tools, the acquisition of new works-what is called the Get process in the Library's digital content life cycle-will require intense activity.

The information here is explanatory and is intended to support human decision-makers. The compilers, however, have been working closely with those planning for a Global Digital Format Registry (GDFR), a collaborative activity initially stimulated by recognition of common interests at Harvard and MIT. The GDFR effort aims for an active registry that will support the execution of operations on files, to identify, validate, and even transform them. Beginning in 2006, GDFR development is proceeding at OCLC, using funding from the Andrew Mellon Foundation. The GDFR is related to the development of the JHOVE toolset for format characterization and validation, and we intend for this Web site to compliment that effort.

Omitted media-independent types
Some intangible formats are being omitted from this analysis, at least in its early manifestations, since the types of content that use these formats is unlikely to find its way into Library of Congress collections in the foreseeable future.  One example is the family of specialized formats that represent numeric data. Another omitted example is the family of sound and moving image formats in which sound and pictures are heavily compressed in order to serve wireless communication, mobile telephony, teleconferencing, and the like. Content from these communication modes is rarely if ever added to the Library's collections.  It is true that some heavily compressed sound formats are employed for digital recorded books and may be used in emerging forms of moving image presentations, and these may indeed find their way to the Library.  However, the example of Audible, Inc., (http://www.audible.com) indicates that publishers in these fields are likely to offer multiple formats for a given title, including ones that are moderately rather than heavily compressed, and these higher quality versions would generally be preferred for the Library's collections.

Media dependency (tangible and intangible content)
This Web site focuses on digital content formats that are independent of the physical medium on which they are stored or transported. Content in such formats has been dubbed "media-independent," "intangible," or "remote" (a cataloger's term), and it exists as data files or data streams. Media-independent digital content is stored and can be transported on media, e.g., CD-R, portable hard disk, and data tape, but this use of media is incidental to the content. In contrast, media-dependent formats are inextricably linked to their physical forms, e.g., audio CDs, DVDs, and digital videotape formats like DigiBeta. The development of preferences for these media-dependent or tangible formats by the Library inevitably raises issues of workflow, management of physical inventory, and media life. Since these issues, while important, are outside the scope of this analysis, media-dependent formats are excluded from this consideration of format preference.

Back to top

Last Updated: Wednesday, 07-Mar-2007 12:40:30 EST