Why should you read this?

If a browser is unable to detect the character encoding used in a page, the content may be unreadable. The information in this tutorial is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing (X)HTML or CSS that uses non-ASCII characters, because, although it looks good to you, other people"s browser settings can affect readability. This tutorial will give you an understanding of the topic that will help you make the right choices.

Objectives

When you have finished this tutorial you should:

have clear idea about factors relating to the choice of encoding for (X)HTML documents, and appreciate the value of using Unicode
know when and how to declare the character encoding (charset) for documents using (X)HTML and CSS
be aware of certain problematic aspects of serving and coding (X)HTML files on older browsers that affect the above
understand what the terms byte-order mark and normalization mean, how they can affect you, and how to deal with them
understand when and how to use escapes to represent characters

About this article

This article has been reviewed by the W3C Internationalization Working Group and has gone through public review to make it as accurate as possible. If there are things that need addressing, please send us feedback using the link near the bottom of the page.

Handling character encodings in HTML and CSS

Intended audience: HTML/XHTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.

This tutorial gathers together and organizes pointers to articles that, taken together, help you understand how to handle the essential aspects of authoring (X)HTML and CSS related to characters and character encodings.

In a nutshell

This section is for people in a hurry who just want to know the key recommendations from the tutorial. If you don't understand something, or if you want more detail, read the rest of the tutorial.

Save your pages as UTF-8, whenever you can.

Always declare the encoding of your document. Use the HTTP header if you can. Always use an in-document declaration too. This table tells you how, depending on what format you are authoring. Use encoding names from the IANA registry.

Use the @charset rule for external style sheets (but not CSS in your HTML page) if you have non-ASCII content, such as font names, ids or class names, etc.

Try to avoid using the byte-order mark in UTF-8, and ensure that your HTML code is saved in Unicode normalization form C (NFC).

Avoid using character escapes, except for invisible or ambiguous characters. And don't use Unicode control characters when you can use markup instead.

The articles pointed to describe the latest thinking with respect to the HTML5 specification. It is important to note, however, that the HTML5 specification is still not stable, so you should approach that information with care.

Essential background information

If you are a newcomer to this topic, there are certain foundational concepts you need to understand if you are to follow various parts of the tutorial. If you are familiar with these concepts, you can skip to the next section.

Choosing and applying a character encoding

Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.

There are many character encodings to choose from. This part of the tutorial offers simple advice on which character encoding to use for your content, and how to apply it.

Choosing & applying a character encoding includes the following:

How to declare a character encoding

You should always specify the encoding used for an HTML or XML page. If you don't, you risk that characters in your content are incorrectly interpreted. This is not just an issue of human readability, increasingly machines need to understand your data too. You should also check that you are not specifying different encodings in different places.

Declaring character encodings in HTML will provide you with quick recommendations for those who just want to be told what to do, and more detailed information for those who need it.

Declaring character encodings in CSS provides information for CSS.

The byte-order mark (BOM)

The byte-order mark, or BOM, is something you will come across when using a Unicode-based character encoding, such as UTF-8 and UTF-16. In some cases you will need to remove the BOM, in others you need to ensure that it is there.

The byte-order mark (BOM) in HTML covers:

Handling character encodings in HTML and CSS

In a nutshell

Essential background information

Choosing and applying a character encoding

How to declare a character encoding

The byte-order mark (BOM)

Unicode normalization forms

Using character escapes

Characters or markup?

Further reading