Email this Article Email   

CHIPS Articles: The Lazy Person's Guide to Controlling Technologies: Document Formats - an Open and Shut case

The Lazy Person's Guide to Controlling Technologies: Document Formats - an Open and Shut case
By Dale Long - April-June 2006
Welcome to the continuing saga of how technology controls our lives. This installment of the Lazy Person's Guide will address digital documents: what they are, how they function and what leverage they exert in our work environment. We will examine text, tags and the march of formats from basic text to desktop publishing capabilities that put the modern equivalent of a printing press on everyone's desktop. But first, as is often our custom, we start with a visit with Zippy.

Zipped Archives

We received our annual invitation to Casa Zippy for their New Year's Day Football Finger Food Fiesta. The party was not for Zippy and me, but for our wives. Zippy and I will watch football as an excuse to tweak a surround sound system to recreate stadium crowd noise.

Our wives, however, are die-hard football fans: Zippette is a Pittsburg Steelers fan and my wife is a lifelong Cleveland Browns fan. Since those two teams were playing each other on New Year's Day, there was no way Zippy and I were going to come near them during the game. Even Zippy's twins, now three-years-old, know to make themselves scarce on Sunday afternoons when the "Moms" are watching football on television.

Zippy, the children and I found ourselves in the safest room in the house: the basement. Zippy's basement is not your average hole in the ground. Most basements range from cement floors and cinder blocks to a finished space with carpet. Zippy's basement resembles nothing less than the North American Aerospace Defense Command's operation center in Cheyenne Mountain, with enough electronics hardware to run a large multinational corporation.

There is a room in Zippy's basement that should be in the Smithsonian Institution. It contains a working version of almost every personal computer (PC) and operating system (OS) produced since the original Apple I back in 1976. Some people play with model trains in their basements; Zippy plays with old computers. Despite being the computer equivalent of a guy who cuts a hole in the floorboard of his sports utility vehicle so he can drive it onto a frozen lake and use it as an ice fishing shanty, Zippy has managed to keep all his relics operational.

Zippy has been keeping digital records as a matter of obsession for more than 25 years. He has financial records in VisiCalc in Apple and PC formats, and both old Mac and new Windows versions of Excel. He has a huge collection of old UseNet files, original short fiction and research papers in AppleWorks, WordStar, WordPerfect (WP), Microsoft Word (four versions), Encapsulated PostScript (EPS), and WriteNow. His favorites are game programs and saved games on Commodores, Amigas and Apple IIs. If there is a "nerdvanna" in the afterlife, it probably looks just like Zippy's basement.

However, Zippy's collection illustrates one of the major issues in computing: proprietary applications and formats. The reason he has these old computers is to maintain the use of information and applications despite changes in technology over the last 30 years. Old applications and formats never die; they just become obsolete.

As more people and organizations migrate to newer systems and software, the pressure increases on the holdouts to join them. We are given the choice of replacing or upgrading to keep up with the technological Joneses or risk ending up stuck in a computing backwater unable to share or receive documents with the rest of the world. To understand why formats matter, let's take a brief look at how they work.

Tagging Text

In the beginning, there was American Standard Code for Information Interchange (ASCII), and ASCII was all anyone had in the early days of the Advanced Research Projects Agency Network (ARPANET). First codified in 1963 by the American National Standards Institute, ASCII was derived from telegraphic codes and first entered commercial use as a seven-bit teleprinter code promoted by Bell Data Services. Originally, it only had uppercase letters and a few odd substitutions for some special characters.

ASCII was upgraded in 1967 to include lowercase letters and enhanced control codes like ACKnowledge, ESCape and DELete. Other than upper- and lowercase lettering, ASCII did not allow for anything in the way of formatting. Early teleprinters worked much like typewriters. ASCII mirrored whatever the printer could produce based on typewriter-style output.

The ASCII text format enabled the development of text editors, computer software capable of editing plain ASCII text. Much like the keypunch machines they eventually replaced, the first editors worked on one line at a time because mainframes of that era generally did not have display screens, just line-feed printers that printed plain text on paper.

The revolution began in earnest when computer monitors became less expensive and more common, facilitating the development of full screen editors that let users see and work on pages of text instead of individual lines. One of these early editors was “vi,” which is still a standard application on Unix and Linux systems today. Other well-known text editors include EMACS (Editor MACroS), Microsoft Notepad and SimpleText. Inevitably, computer users wanted more complex text output from computers.

Text Markup

I have written in CHIPS about the earliest forms of complex text markup http://www.chips.navy.mil/archives/94_apr/file12.html): Generalized Markup Language (GML) and Standard Generalized Markup Language (SGML). While they are important as the ancestors of modern formatting, most people using computers today have no contact with them. Therefore, we will start the discussion with two formats present on almost every modern computer: Rich Text Format (RTF) and Hypertext Markup Language (HTML).

The RTF document file format was originally developed by Microsoft in 1987 for sharing documents across computing platforms. Most word processors can read and write RTF documents. RTF allows simple text formatting, including: boldface, italics, underline or some combination of the three. These are probably the most common text attributes. RTF is normally limited to applying formatting to 7-bit ASCII text, though it is possible to produce other characters in Arabic or Cyrillic through the use of special codes.

Accompanying RTF were changes in printers from typewriter-like keys and “print balls” to dot matrix printers that could reproduce fonts and characters in different sizes and shapes without requiring the operator to physically replace parts of the printer for every new font. Computer monitors also progressed, from displaying simple monochrome text to rendering millions of colors. Advances in all three (printers, displays and formatting) have been inextricably linked over the last 25 years, with advances in one area enabling or driving improvements in the others.

HTML is arguably the most pervasive text formatting system in the world due to its presence on millions of Web pages. HTML is a simple subset of SGML and probably the simplest tagging schema ever. For example, the formatted text in the previous paragraph would be tagged as follows:
boldface = boldface
italics = italics
underline = underline
combination = combination

Tags are like on and off switches. In HTML the “hairpin” brackets identify the text within them as a tag, not content. The tag inside the brackets preceding the text turns the desired formatting on and the tag with the “/” following the text turns the formatting off. Web browsers interpret HTML tags and present the text in various sizes, styles, colors and fonts based on what the tags instruct.

While the various schemas used by word processing and other applications that use text vary in complexity, they all follow the same basic process of enclosing the formatted text within coded tags. In RTF, for example, boldface text would be tagged like this: {\b boldface} = boldface

As you can see, in RTF the affected text is enclosed within the curly brackets along with its tag.

One more format that deserves mention is Extensible Markup Language. XML, like HTML, is a subset of SGML and is a general-purpose markup language from which special-purpose markup languages like Geography Markup Language (GML), Real Simple Syndication (RSS), Mathematical Markup Language (MathML), Physical Markup Language (PML) and MusicXML have been created. XML-based systems facilitate data sharing across different systems and particularly systems connected via the Internet.

The differences in formatting structure and codes between HTML, RTF and every scheme that tags and formats text illustrate a crucial facet of text formatting. While all formatting schemes tell applications how to display or reproduce text, they differ from each other in ways both large and small. Therein lies the crux of the issue.

Open and Shut Case

As the people who write software applications come up with new, innovative ways to format content they add new tags to their formats.

As formatting evolves, the people developing these applications have two choices: open their formats so everyone can use them, or keep them closed so they only work well with their own software.

Let us look at Microsoft as an example of how formats help influence behavior. In the late 1980s when Microsoft was trying to facilitate the spread of Microsoft Disk Operating System (MS-DOS) and Windows PCs, Microsoft released RTF as an open standard to facilitate the transfer of documents between computers. However, RTF had fairly limited formatting capability.

The king of IBM word processing software in 1990 was WordPerfect, which had extensive formatting capability even as a DOS application and a 90 percent market share.

Soon after its release of the Windows shell for MS-DOS, Microsoft released Word, a more sophisticated word processor with “what you see is what you get” (WYSIWYG) display capability and the ability to import WordPerfect files into Word’s document (.doc) format. Unlike WordPerfect, which allowed you to “reveal” the codes formatting the text in a separate window, MS Word kept the formatting codes hidden to shield the user from needless complexity.

Aside from the obvious advantages of the WYSIWYG interface, and that Microsoft bundled more than just a word processor into MS Office, there was one technical detail that greatly facilitated the migration of users from WordPerfect to MS Word. MS Word could read and convert WP documents much more cleanly than WP could do in the other direction.

Over the next five years, MS Word assumed a dominant position in the marketplace, and Microsoft Office application files became default standards for most commercial and government agencies. Even if organizations did not use Microsoft applications, they still had to find a way to work with Microsoft file formats, if they wanted to exchange information with other groups that did.

MS Office file formats have evolved every time the associated application evolves. When MS Word went from version 2.0 to version 6.0, the file format changed. The result was that those people still using Word 2.0 could not read version 6.0 document files, which was the default file format for Word 6.0. Every organization that migrated to Word 6.0 put pressure on any holdouts to upgrade to stay compatible.

I submit that most organizations, public and private, create the vast majority of their electronic files in some closed, proprietary format. Word processing documents, spreadsheets, presentation slides, databases, e-mail, network directories — virtually all of the electronic files in the government and commercial world are in some proprietary, closed format.

This reliance on closed formats raises a few issues. First, the more documents an organization has in a proprietary format, the less likely it is to migrate from the applications that can read those documents. Thus, the need to access documents stored in these formats can create an organizational dependency. When the company that produces that application upgrades to a new version, the organizations that use it are compelled to buy the upgrade — whether or not it makes a functional difference to operations.

On another level, closed document formats interfere to a great extent with enterprise electronic records management. Aside from the issues associated with accessing documents produced by different versions of an application, managing collections of documents across an agency becomes very problematic when the application bundles multiple documents into files that can only be accessed by the parent application.

This is generally the case with both database applications and email programs. And while you can build retention and disposition rules into individual databases, e-mail records stored in one or more files that contain multiple e-mails defy any management other than manual intervention by the e-mail account user.

Alternatives

If you are happy with proprietary document formats, you can keep on using whatever software you are committed to currently. Most users just want to get their work done with the least amount of fuss possible. If, however, you are tired of the inability to manage documents and records independently of the applications that created them, you may wish to consider using applications compatible with the Organization for the Advancement of Structured Information Standards (OASIS) Open Document Format (ODF) for Office Applications standard.

OASIS is a global consortium working on the development, convergence and adoption of e-business and Web service standards. At present, OASIS initiatives include open standards for Web services, e-commerce, security, supply chains, computing management, applications, documents, XML and interoperability.

ODF is an open document file format for saving and exchanging editable office documents such as text documents, spreadsheets, charts and presentations based on an XML-based file format originally created for OpenOffice.org software applications.

OpenOffice.org is a free, open source software application that includes word processor, spreadsheet, presentation, vector drawing and database components. It is available for most current computing platforms, including Microsoft Windows, Unix, Linux and Mac OS X and supports the OpenDocument standard for data interchange. Sun Microsystems released the source code in July 2000, but it hasn’t caught on with users.

However, because open source Linux OS has spread successfully, IT managers are becoming more receptive to open source in general. The chief information officer of Massachusetts announced last year that beginning in 2007 all official electronic documents in the commonwealth must comply with the OASIS ODF standard.

On another front, open source databases like MySQL (Structured Query Language), INteractive Graphics REtrieval System (Ingres) and PostgreSQL are becoming more popular. Version 5 of MySQL, the current open source leader, was downloaded 4 million times in the first three months after its release in October 2005. If only one of every thousand downloads becomes an active business application that means 4,000 organizations would be using an open source database.

Oracle, IBM, and Microsoft are countering by bundling more and more applications with their databases. Oracle, for example, has in recent years acquired Siebel, PeopleSoft and J.D. Edwards. All three offer high-end applications functionality that open source databases cannot.

References and Closing

For more information on OASIS standards, visit the OASIS Web site at: http://www.oasis-open.org/specs/index.php. For more information on open source databases, see: http://www.businessweek.com/technology/content/feb2006/tc20060206_918648.htm.

Will open source eventually displace proprietary software in the market?

As with any IT purchase there will be costs, benefits, break-even points and functionality issues. But I believe that with Linux giving open source credibility and the ease of using open document standards, we will see some movement toward open source standards.

Open source true believers are a little more optimistic. They have adopted a quote from Mohandas K.Gandhi to refer to the proprietary standards they seek to supplant: “First they ignore you, then they laugh at you, then they fight you, then you win.”

Until next time, Happy Networking!

Long is a retired Air Force communications officer who has written regularly for CHIPS since 1993. He holds a Master of Science degree in Information Resource Management from the Air Force Institute of Technology. He is currently serving as a telecommunications manager in the U.S. Department of Homeland Security.

The views expressed here are solely those of the author, and do not necessarily reflect those of the Department of the Navy, Department of Defense or the United States government.

Related CHIPS Articles
Related DON CIO News
Related DON CIO Policy
CHIPS is an official U.S. Navy website sponsored by the Department of the Navy (DON) Chief Information Officer, the Department of Defense Enterprise Software Initiative (ESI) and the DON's ESI Software Product Manager Team at Space and Naval Warfare Systems Center Pacific.

Online ISSN 2154-1779; Print ISSN 1047-9988