Library of Congress Guidelines for HTML 4.01
[ HOME ] [ Introduction ] [ HTML Overview ] [ HTML 4.01 Tags ] [ Style Sheets ] [ Links ]

What is HTML?

Hypertext Markup Language or HTML is used to structure and format documents for presentation on the World Wide Web. In the strictest sense, HTML is not just a markup language, but a Document Type Definition (DTD) of Standard Generalized Markup Language (SGML). HTML enhances plain text (ASCII) files with markup tags that:

  1. permit the display of images and text in a variety of heading styles and formats (e.g., bold, italics, quoted text, emphasis, citation);
  2. designate structural elements such as headings, lists, and paragraphs; and
  3. provide hypertext links, which make a word, phrase or an image into a pointer to an internal location within the same document, to other documents or multimedia objects on the same server, or to documents or multimedia objects anywhere on the Internet.

Hypertext links may be made to other HTML pages or to Gopher menus/files, FTP directories/files, USENET news archives, Telnet sessions, images, and many other formatted documents or files.

Flavors of HTML 4.01

HTML 4.01 is specified in three "flavors". You specify which of these variants you are using by inserting a line at the beginning of the document. For example, the HTML for this document starts with a line which says that is it using HTML 4.01 Transitional. Thus, if you want to validate the document, the tool used knows which variant you are using. Each variant has its own DTD - Document Type Definition - which sets out the rules and regulations for using HTML in a succinct and definitive manner

HTML 4.01 specifies three DTDs, so authors must include one of the following document type declarations in their documents. The DTDs vary in the elements they support.

The HTML 4.01 Transitional DTD includes everything in the strict DTD plus deprecated elements and attributes (most of which concern visual presentation). For documents that use this DTD, use this document type declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd">

The HTML 4.01 Strict DTD includes all elements and attributes that have not been deprecated or do not appear in frameset documents. For documents that use this DTD, use this document type declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html40/strict.dtd">

The HTML 4.01 Frameset DTD includes everything in the transitional DTD plus frames as well. For documents that use this DTD, use this document type declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html40/frameset.dtd">

The Library of Congress will use the HTML 4.01 Transitional DTD in its initial HTML implementation.

HTML Tags or Codes

Most HTML tags are used in pairs -- the "start tag" and "end tag" -- although some are used individually. Tags are enclosed in angle brackets (< >); start and end tags are the same except for the addition of a forward slash (/) in the end tag. The general format is:


                           <TAG>text</TAG>

                             |   |    |

                         Start   |    End

                           Tag   |    Tag

                              Content

HTML tags are not case sensitive -- <TITLE> is equivalent to <title> -- although uppercase tags may be easier to distinguish visually from the actual text of the HTML document.

Structure of HTML Tags

Each tag has an opening angle bracket ( < ) and ends with the tag closing angle bracket ( > ). Between the delimeters is a name consisting of a letter followed by up to 72 letters, digits, periods, or hyphens. HTML tag names are not case sensitive ( <H1> is equivalent of <h1> ). The table of tags attached to this document presents a list of currently conformant HTML elements.

Basic Structure of an HTML Document

An HTML 4.01 document begins with a <!DOCTYPE ...> declaration (also called the "prologue") that declares the version of HTML to which the document conforms. Next the entire content of the document is enclosed in the <HTML>...</HTML> container. Within that container are two sections called "head" and "body"; each of these sections are similarly enclosed in the containers tags, <HEAD>...</HEAD> and <BODY>...</BODY>, respectively. The "head" section contains information about the document, such as its title and keywords, while the "body" section contains the actual content of the document.

Therefore, an HTML 4.01 document is composed of three parts:

  1. <!DOCTYPE ...> - a line containing HTML version information

  2. <HEAD>...</HEAD> - a declarative header section

  3. <BODY>...</BODY> - a section containing the document's actual content (the body may be implemented using the <BODY> container or the <FRAMESET> container (when using "frames")

White space (spaces, newlines, tabs, and comments) may appear before or after each section. Sections 2 and 3 should be enclosed within the HTML element.

Here's an example of a simple HTML document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"

   "http://www.w3.org/TR/html40/strict.dtd">

<HTML>

   <HEAD>

      <TITLE>My first HTML document</TITLE>

   </HEAD>

   <BODY>

      <P>Hello world!

   </BODY>

</HTML>

Attributes Used With Tags

An opening HTML tag can also contain a qualifying attribute -- attributes modify the behavior of a tag. An attribute typically consists of an attribute name, an equal sign, and a value (in some instances attributes may not include the equal sign and value). A space is allowed before and after the equal sign but it is not required.

The value of an attribute may be either:

  1. a string of alphabetical or numerical characters, normally enclosed in double quotes; or
  2. a string of alphabetical or numerical characters combined with characters like hyphens, punctuation marks (other than the double quote, ! ? , . ), tilde (~), ampersand (&), at sign (@), number sign (#), dollar sign ($), plus sign (+), equal sign (=), percent (%), circumflex (^), asterisk (*), or parentheses, enclosed in double quotes.

In this example, A is the tag name, HREF= is the attribute, and http://host.loc.gov/directory/file.html is the value

for the attribute.

Example:

<A HREF="http://host.loc.gov/directory/file.html">HTML FILE</A>

Use of Quotes with HTML Attributes

Although double quotes ( " ) are not required around every attribute element in HTML, they may we used in every case. However, single quotes ( ' ) should never be used. To determine when quotes are required, examine the data following the = sign for the attribute. If that data contains a combination of letters and numbers, letters and other characters, or numbers and other characters, it should be enclosed in quotes.

Examples:

Quotes required:

Quotes not required:

Library of Congress Required HTML 4.01 Tags

Every HTML 4.01 document begins with the prologue or SGML declaration (in order to identify the version of the DTD being conformed to), followed by HTML, HEAD and BODY sections or containers. The following is an HTML skeleton document containing the Library of Congress required tags. Each tag is described following this example.

Example:


   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

                          "http://www.w3.org/TR/html40/loose.dtd">

	<HTML>

	<HEAD>

	<TITLE>Library of Congress title here</TITLE>

	<BASE HREF="http://www.loc.gov/etc/filename.html">

	<META NAME="description" CONTENT="...description...">	

	<META NAME="keywords" CONTENT="keyword keyword keyword">

	</HEAD>

	<BODY>

		This is where the text and other content goes.

	</BODY>

	</HTML>

Prologue Tag

The prologue appears at the beginning of every HTML page, identifies what follows as an HTML document allowing browsers and other special software to distinguish HTML documents from other types (DTDs) of SGML. All HTML documents written according to the current HTML specification (Version 4.01) should use the prologue tag displayed above.

The following descriptions of the pieces of the prologue tag were prepared by Murray Altheim (murray@spyglass.com).

            1     2     3   4   5         6	         7

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >
  1. HTML - The SGML document type being declared: <HTML> ... </HTML>
  2. PUBLIC - Identifies the information in quotes as a Formal PUBLIC Identifier.
  3. "-" - The minus sign designates unregistered organization. ISO, registered (+) or unregistered (-) are possibles here. W3C is not currently registered with ISO, therefore the (-) is used.
  4. W3C - identifies the party responsible for creation/maintenance of the DTD. If the DTD comes from IETF, W3C, etc. you'll see their ID here.
  5. DTD - describes the type of object, called a Public Text Class. In this case, it is a DTD.
  6. HTML 4.01 Transitional - is the Public Text Description. Here you'll find the DTD's name, plus flavors such as version numbers, "strict", "draft," "transitional," etc.
  7. EN - identifies the Public Text Language, describing the natural language in which the public text is written, represented by two, uppercase-only characters from ISO 639. "EN"= English.

<HTML> Section

The <HTML>...</HTML> container encloses the entire document itself, but not the prologue; it acts as a container for every element in the document. The PROLOGUE is always presented outside and above the opening <HTML> tag as illustrated in the skeleton document. Within the <HTML>...</HTML> container, there should be two sections: the "head" section and the "body" section.

<HEAD> Section

The <HEAD>...</HEAD> container encloses items and information that generally are not displayed in the browser's display window. In the skeleton document, notice that the <TITLE>, <BASE>, and <META> tags are included in the <HEAD>...</HEAD> container.

<TITLE> Tag

The HTML title is enclosed in the <TITLE>...</TITLE> container tag. Although not displayed in the browser display window, it is normally displayed in the title bar of the browser (in the case of LYNX the HTML title is listed at the very top of the page as a non-scrolling element). The HTML title is also used extensively for indexing and retrieval. Most search engines, like AltaVista, give the text listed in the title some priority during indexing, and use the title text as the "title" of the item as listed on the screen of search results. Careful attention should always be taken when constructing the HTML title: it should be unique and the most important words should come first. The title text should be limited to 72 characters when possible; however, it is possible to extend that length if needed.

<BASE> Tag

The <BASE> tag includes an attribute identifying the URL for the current HTML document. The <BASE> tag should be used to ensure proper resolution of any "relative" URLs used in the document. The <BASE> tag includes the attribute HREF= to identify the URL. The following example gives the <BASE> tag for the Library of Congress Home Page.

Example:

<BASE HREF="http://www.loc.gov/homepage/lchp.html">

<META> Tag

The <META> tag is used to convey meta-information about the document (descriptive words and phrases used extensively in indexing and retrieval), but can also be used to specify file headers (specifying language encoding, etc.) for the document. You can use either NAME= or HTTP-EQUIV= attributes to name the meta-information, but the CONTENT= attribute must be used in both cases.

Library of Congress pages should always contain two specific <META> tags for the purpose of providing better information to the indexing and retrieval services (including our own) that index our pages for discovery on the Web. The NAME= attribute is used to define the name of the meta element and the CONTENT= attribute is used define the content of the named meta element.

The following examples illustrate the required uses of the <META> tags for Library of Congress pages.

Examples:

The following <META> tag is also recommended but not required.

Example:

An example of a <META> tag using the HTTP-EQUIV= attribute is the

<BODY> Section

The <BODY>...</BODY> container tags enclose all of the text and other HTML tags that are used to display the actual document and multimedia content to the user.

HTML Elements Used in the Document Body

There are now two types of HTML elements occurring in the BODY of an HTML document:

  1. Block-level Elements (structural elements like headings, paragraphs, table cells, etc.) Block-level elements typically contain inline elements and other block-level elements.
  2. Inline Elements (emphasis, strong, citation, hypertext anchor, etc.)
    Inline elements typically may only contain text and other inline elements. When rendered visually, inline elements do not usually begin on a new line.

Generally, block-level elements begin on new lines, inline elements do not.

Special Characters

The text of HTML documents is typically in plain or ASCII text, but some non-ASCII characters can also be displayed by using a series of defined special character codes or "entities."

The basic format of a special character code is ampersand ( & ) followed by the designated character string and ending with a semi-colon ( ; ). The ampersand instructs the Web browser to ignore the regular meaning of the following letters or numbers and insert the indicated special character instead.

In HTML 4.01, there are three methods of expressing "character string" (the data between the & and the ; )

Using alphabetical strings:

&aacute;  produces  á

&egrave;  produces  è
Using numeric (decimal) strings:

&#225;  produces  á

&#232;  produces  è
Using numeric (hexidecimal) strings:

&#xE1;  produces  á

&#xE8;  produces  è

By far, the easiest ones to remember are the "character strings"; however, the "numeric (decimal) strings" also work well. Currently, the browser support for the "numeric (hexidecimal) strings" is very poor. In addition, although many new special character entities have been added as part of HTML 4.01, most of them are still not supported by the browsers. A complete list of these codes (also called entities) is available for use in HTML 4.01 via Entities (Web Design Group). This table shows the character entity references in HTML 4.01, along with the numeric character reference in decimal and hexadecimal. A rendering of each character reference is provided so that users may check their browsers' compliance.

Comments

It is possible to include "commented" information within the HTML coding for a document. Comments can be helpful, to provide information about who last updated a document, what a section of HTML coding is fo r, etc. HTML comments still are expressed within the opening and closing angle brackets; however, a comment begins with "<!--", and ends with "-->" string. Consequently, do no not use "--" (the double dash) within the comment.

<!-- An example comment -->

[ HOME ] [ Introduction ] [ HTML Overview ] [ HTML 4.01 Tags ] [ Style Sheets ] [ Links ]
[ Library of Congress Standards ] [ Library of Congress Home Page ]

Library of Congress
Library of Congress Help Desk (January 25, 2001)

Maintained by the Network Development and MARC Standards Office
Links to detailed documentation on Web Design Group site are provided.