Hypertext Markup Language or HTML is used to structure and format documents for presentation on the World Wide Web. In the strictest sense, HTML is not just a markup language, but a Document Type Definition (DTD) of Standard Generalized Markup Language (SGML). HTML enhances plain text (ASCII) files with markup tags that:
Hypertext links may be made to other HTML pages or to Gopher menus/files, FTP directories/files, USENET news archives, Telnet sessions, images, and many other formatted documents or files.
HTML 4.01 is specified in three "flavors". You specify which of these variants you are using by inserting a line at the beginning of the document. For example, the HTML for this document starts with a line which says that is it using HTML 4.01 Transitional. Thus, if you want to validate the document, the tool used knows which variant you are using. Each variant has its own DTD - Document Type Definition - which sets out the rules and regulations for using HTML in a succinct and definitive manner
HTML 4.01 specifies three DTDs, so authors must include one of the following document type declarations in their documents. The DTDs vary in the elements they support.
The HTML 4.01 Transitional DTD includes everything in the strict DTD plus deprecated elements and attributes (most of which concern visual presentation). For documents that use this DTD, use this document type declaration:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html40/loose.dtd">
The HTML 4.01 Strict DTD includes all elements and attributes that have not been deprecated or do not appear in frameset documents. For documents that use this DTD, use this document type declaration:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html40/strict.dtd">
The HTML 4.01 Frameset DTD includes everything in the transitional DTD plus frames as well. For documents that use this DTD, use this document type declaration:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html40/frameset.dtd">
The Library of Congress will use the HTML 4.01 Transitional DTD in its initial HTML implementation.
Most HTML tags are used in pairs -- the "start tag" and "end tag" -- although some are used individually. Tags are enclosed in angle brackets (< >); start and end tags are the same except for the addition of a forward slash (/) in the end tag. The general format is:
<TAG>text</TAG> | | | Start | End Tag | Tag Content
HTML tags are not case sensitive -- <TITLE> is equivalent to <title> -- although uppercase tags may be easier to distinguish visually from the actual text of the HTML document.
Each tag has an opening angle bracket ( < ) and ends with the tag closing
angle bracket ( > ). Between the delimeters is a name consisting of a letter
followed by up to 72 letters, digits, periods, or hyphens. HTML tag names are
not case sensitive ( <H1>
is equivalent of <h1>
). The table of tags attached to this document presents a list of currently
conformant HTML elements.
An HTML 4.01 document begins with a <!DOCTYPE ...>
declaration
(also called the "prologue") that declares the version of HTML to
which the document conforms. Next the entire content of the document is enclosed
in the <HTML>...</
HTML>
container. Within
that container are two sections called "head" and "body";
each of these sections are similarly enclosed in the containers tags, <HEAD>...</HEAD>
and <BODY>...</BODY>
, respectively. The "head"
section contains information about the document, such as its title and keywords,
while the "body" section contains the actual content
of the document.
Therefore, an HTML 4.01 document is composed of three parts:
<!DOCTYPE ...>
- a line containing HTML version information
<HEAD>...</HEAD>
- a declarative header section
<BODY>...</BODY>
- a section containing the document's
actual content (the body may be implemented using the <BODY>
container or the <FRAMESET>
container (when using "frames")
White space (spaces, newlines, tabs, and comments) may appear before or after each section. Sections 2 and 3 should be enclosed within the HTML element.
Here's an example of a simple HTML document:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html40/strict.dtd"> <HTML> <HEAD> <TITLE>My first HTML document</TITLE> </HEAD> <BODY> <P>Hello world! </BODY> </HTML>
An opening HTML tag can also contain a qualifying attribute -- attributes modify the behavior of a tag. An attribute typically consists of an attribute name, an equal sign, and a value (in some instances attributes may not include the equal sign and value). A space is allowed before and after the equal sign but it is not required.
The value of an attribute may be either:
In this example, A
is the tag name, HREF=
is the attribute, and http://host.loc.gov/directory/file.html
is
the value
for the attribute.
Example:
<A HREF="http://host.loc.gov/directory/file.html">HTML FILE</A>
Although double quotes ( " ) are not required around every attribute element in HTML, they may we used in every case. However, single quotes ( ' ) should never be used. To determine when quotes are required, examine the data following the = sign for the attribute. If that data contains a combination of letters and numbers, letters and other characters, or numbers and other characters, it should be enclosed in quotes.
Examples:
Quotes required:
HREF="http://www.loc.gov/"
ALT="Library of Congress Home Page"
(quotes always required
with the ALT=
attribute)
SIZE="+1"
NAME="top"
(quotes always required with the NAME=
attribute)
Quotes not required:
SIZE=1
ALIGN=CENTER
Every HTML 4.01 document begins with the prologue or SGML declaration (in order to identify the version of the DTD being conformed to), followed by HTML, HEAD and BODY sections or containers. The following is an HTML skeleton document containing the Library of Congress required tags. Each tag is described following this example.
Example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html40/loose.dtd"> <HTML> <HEAD> <TITLE>Library of Congress title here</TITLE> <BASE HREF="http://www.loc.gov/etc/filename.html"> <META NAME="description" CONTENT="...description..."> <META NAME="keywords" CONTENT="keyword keyword keyword"> </HEAD> <BODY> This is where the text and other content goes. </BODY> </HTML>
The prologue appears at the beginning of every HTML page, identifies what follows as an HTML document allowing browsers and other special software to distinguish HTML documents from other types (DTDs) of SGML. All HTML documents written according to the current HTML specification (Version 4.01) should use the prologue tag displayed above.
The following descriptions of the pieces of the prologue tag were prepared by Murray Altheim (murray@spyglass.com).
1 2 3 4 5 6 7 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >
HTML
- The SGML document type being declared: <HTML> ...
</HTML>
PUBLIC
- Identifies the information in quotes as a Formal PUBLIC
Identifier.
-
" - The minus sign designates unregistered organization.
ISO, registered (+) or unregistered (-) are possibles here. W3C is not currently
registered with ISO, therefore the (-) is used.
W3C
- identifies the party responsible for creation/maintenance
of the DTD. If the DTD comes from IETF, W3C, etc. you'll see their ID here.
DTD
- describes the type of object, called a Public Text Class.
In this case, it is a DTD.
HTML 4.01 Transitional
- is the Public Text Description. Here
you'll find the DTD's name, plus flavors such as version numbers, "strict",
"draft," "transitional," etc.
EN
- identifies the Public Text Language, describing the natural
language in which the public text is written, represented by two, uppercase-only
characters from ISO 639. "EN"= English.
<HTML>
SectionThe <HTML>...</HTML>
container encloses the entire
document itself, but not the prologue; it acts as a container for every element
in the document. The PROLOGUE is always presented outside and above the opening
<HTML>
tag as illustrated in the skeleton document. Within
the <HTML>...</HTML>
container, there should be two
sections: the "head" section and the "body" section.
<HEAD>
SectionThe <HEAD>...</HEAD>
container encloses items and
information that generally are not displayed in the browser's display window.
In the skeleton document, notice that the <TITLE>
, <BASE>
,
and <META>
tags are included in the <HEAD>...</HEAD>
container.
<TITLE>
Tag The HTML title is enclosed in the <TITLE>...</TITLE>
container tag. Although not displayed in the browser display window, it is normally
displayed in the title bar of the browser (in the case of LYNX the HTML title
is listed at the very top of the page as a non-scrolling element). The HTML
title is also used extensively for indexing and retrieval. Most search engines,
like AltaVista, give the text listed in the title some priority during indexing,
and use the title text as the "title" of the item as listed on the screen of
search results. Careful attention should always be taken when constructing the
HTML title: it should be unique and the most important words should come first.
The title text should be limited to 72 characters when possible; however, it
is possible to extend that length if needed.
<BASE>
Tag The <BASE>
tag includes an attribute identifying the URL
for the current HTML document. The <BASE>
tag should be used
to ensure proper resolution of any "relative" URLs used in the document. The
<BASE>
tag includes the attribute HREF=
to identify
the URL. The following example gives the <BASE>
tag for the
Library of Congress Home Page.
Example:
<BASE HREF="http://www.loc.gov/homepage/lchp.html">
<META>
TagThe <META>
tag is used to convey meta-information about
the document (descriptive words and phrases used extensively in indexing and
retrieval), but can also be used to specify file headers (specifying language
encoding, etc.) for the document. You can use either NAME=
or HTTP-EQUIV=
attributes to name the meta-information, but the CONTENT=
attribute
must be used in both cases.
Library of Congress pages should always contain two specific <META> tags for the purpose of providing better information to the indexing and retrieval services (including our own) that index our pages for discovery on the Web. The NAME=
attribute is used to define the name of the meta element and the CONTENT=
attribute is used define the content of the named meta element.
The following examples illustrate the required uses of the <META>
tags for Library of Congress pages.
Examples:
<META NAME="keywords" CONTENT="keyword keyword keyword">
<META NAME="description" CONTENT="This is a site...">
The following <META>
tag is also recommended
but not required.
Example:
<META NAME="author" CONTENT="Name">
An example of a <META>
tag using the HTTP-EQUIV=
attribute is the
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
<BODY>
SectionThe <BODY>...</BODY>
container tags enclose all of
the text and other HTML tags that are used to display the actual document and
multimedia content to the user.
There are now two types of HTML elements occurring in the BODY of an HTML document:
Generally, block-level elements begin on new lines, inline elements do not.
The text of HTML documents is typically in plain or ASCII text, but some non-ASCII characters can also be displayed by using a series of defined special character codes or "entities."
The basic format of a special character code is ampersand ( & )
followed by the designated character string
and ending with a semi-colon
( ; )
. The ampersand instructs the Web browser to ignore the regular
meaning of the following letters or numbers and insert the indicated special
character instead.
In HTML 4.01, there are three methods of expressing "character string"
(the data between the &
and the ;
)
á produces á è produces èUsing numeric (decimal) strings:
á produces á è produces èUsing numeric (hexidecimal) strings:
á produces á è produces è
By far, the easiest ones to remember are the "character strings"; however, the "numeric (decimal) strings" also work well. Currently, the browser support for the "numeric (hexidecimal) strings" is very poor. In addition, although many new special character entities have been added as part of HTML 4.01, most of them are still not supported by the browsers. A complete list of these codes (also called entities) is available for use in HTML 4.01 via Entities (Web Design Group). This table shows the character entity references in HTML 4.01, along with the numeric character reference in decimal and hexadecimal. A rendering of each character reference is provided so that users may check their browsers' compliance.
It is possible to include "commented" information within the HTML
coding for a document. Comments can be helpful, to provide information about
who last updated a document, what a section of HTML coding is fo r, etc. HTML
comments still are expressed within the opening and closing angle brackets;
however, a comment begins with "<!--
", and ends with "-->
"
string. Consequently, do no not use "--" (the double dash) within the comment.
<!-- An example comment -->
Maintained by the Network Development and MARC Standards Office
Links to detailed documentation on Web Design Group site are provided.