Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts

Intended audience: HTML/XHTML and CSS content authors implementing pages in right-to-left scripts such as Arabic and Hebrew, or having to deal with embedded right-to-left script text. This material is applicable whether you create documents in an editor, or via scripting.

Updated 2009-07-10 7:07

Why should you read this?

Getting bidirectional text to display correctly can sometimes appear baffling and frustrating, but it need not be so. If you have struggled with this or have yet to start, this tutorial should help you adopt the best approach to marking up your content, and explain enough of how the bidirectional algorithm works that you will understand much better the root causes of most of your problems. We will also address some common misconceptions about ways to deal with markup for bidirectional content.

Objectives

By following this tutorial you should be able to:

Right-to-left scripts are used by numerous languages, including Arabic, Hebrew, Pashto, Persian, Sindhi, Syriac, Thaana, Urdu, Yiddish, etc.

Setting document direction

This section covers:

Base direction

Before we start, we need to introduce an important concept.

In order for text to look right when an HTML page is displayed, we need to establish the directional context of that text. We will refer to that context as the base direction for the text.

It is fundamentally important to establish the appropriate base direction so that the bidirectional algorithm can produce the expected ordering and alignment of the displayed text.

In HTML the base direction is either set explicitly by the nearest parent element that uses the dir attribute, or, in the absence of such an attribute, the base direction is inherited from the default direction of the document, which is left-to-right (LTR).

Setting up a right-to-left page

Add dir="rtl" to the html tag any time the overall document direction is right-to-left. This sets the base direction for the whole document.

No dir attribute is needed for documents that have a base direction of left-to-right, since this is the default.

Illustration of dir attribute set in the html tag.

Illustration of dir set in the html tag.

Adding dir="rtl" to the html element will cause block elements and table columns to start on the right and flow from right to left. All block elements in the document will inherit this setting unless the direction is explicitly overridden.

Note, however, that this does not affect the directionality of the text in the title bar. (Depending on the browser and the platform, adding the Unicode characters U+202B (RLE) and U+202C (PDF) around the title text may order the text from right to left.)

An example page before (left) and after (right) the dir attribute is added to the html tag.

An example page before (left) and after (right) the dir attribute is added to the html tag.

Language tags.

While you are declaring the directionality of the document in the html tag, don't forget to declare the language of the document using the lang and/or xml:lang attributes (see Specifying the language of content). However, do not make the mistake of assuming that language declarations indicate directionality, or vice versa! Even if a script tag is used in the language attribute value, this has no implication with regards to the directionality of the text in the user agent. You must always declare the directionality using the dir attribute.

Behaviour in Internet Explorer.

Note that in Internet Explorer, applying a right-to-left direction in the html or body tag will affect the user interface, too.

The scroll bar will appear to the left side of the window, and JavaScript alert message boxes such as the one shown in the picture below will be mirror imaged (see the tests). (Note how the yellow icon on the JavaScript dialog box appears on the right, and the logical order of the text, <arabic> W3C <hebrew>, is displayed from right to left.) This behavior does not occur in other browsers.

Using the dir attribute in the html tag in Internet Explorer causes changes to the browser chrome.

Some speakers of languages that use right-to-left scripts prefer the the directionality of the user interface to be associated with the desktop environment, not with the content of a particular document. Because of this, they may prefer not to declare the document directionality on the html or body tag. To avoid this without tagging every block element in the document you could add a div element immediately inside the body element that surrounds all the other content in the document, and apply the dir attribute to that. The directionality will then be inherited by all other block elements in the body of the document, but will not set off the changes to the browser. If you do this, you must ensure that you add a dir attribute to the head element also, to cover its title element, attribute values, etc.

Be logical, not visual

Visual ordering of text was common for old user agents that didn't support the Unicode bidirectional algorithm. Text was stored in the source code in the same order you would expect to see it displayed.

With logical ordering, text is stored in memory in the order in which it would normally be typed (and usually pronounced). The Unicode bidirectional algorithm is then applied by the browser to render the correct visual display.

Visually ordered bidirectional HTML does not conform to the HTML specification.

If you are lucky, you will not have to deal with this. If you are not, you should read this section.

(Visual ordering isn't really seen much for Arabic. Since the Arabic letters are all joined up there was a stronger motivation on the part of Arabic implementers to enable the logical ordering approach.)

Logical and visual storage order contrasted.

On the picture above, the phrase פעילות הבינאום, W3C is shown, at the top in blue, as it would normally appear when displayed in a right-to-left paragraph. The numbered arrows show the reading direction. You read the sequences in the order of the numbers.

The 2nd line shows the order of characters in memory in logical encoding order (assuming that the first character in memory is on the left, the next to its right, and so on). The 3rd line shows the order of characters in memory in visual encoding order (with the same assumptions about order in memory).

To make visual ordering work, in addition to writing the text backwards, you must also do such things as disabling any line wrapping, explicitly right-aligning text in paragraphs and table cells, adding explicit line breaks, and, when translating from a language that uses a left-to-right script, manually reversing the order of table columns.

Take note that, if you want to add a few words in the middle of a paragraph of visually ordered text, you would have to move text to and from every line that followed it in the paragraph.

(Note that ISO-8859-8 is used for visually encoded Hebrew. We will mention alternative logical encodings shortly.)

An example of visual code, with extra markup highlighted.

In addition, all the extra tags needed to manage the text would bloat your code and impact not only authoring time, but also bandwidth.

Note, too, that if you have in-line markup, such as emphasis or link text, that spans more than one line, you will need to mark up the text runs on both lines separately. Again, adding text before such markup in a paragraph would mean that you have to carefully change this markup to reflect the new position of the text.

The result of all this is very fragile code that is difficult to maintain.

Using logically ordered text, on the other hand, makes it almost trivial to create long paragraphs of flowing text that automatically wrap to the width of the block element. It also makes it much easier to address accessibility, using such things as screen readers.

Visual character encodings.

One last point: We always recommend that you use UTF-8 as the character encoding of your page, but if you choose to use an ISO 8859 encoding instead, you need to take some care in declaring the encoding. You declare the encoding of your content either in the HTTP header or in the meta Content-Type statement inside the document.

There are special conventions with regard to the encoding declarations used for Hebrew text that relate to the visual vs. logical ordering question. A declaration of ISO-8859-8 would indicate that the text is visually encoded. For logically-ordered content you must label ISO-encoded text as ISO-8859-8-i.

Changing block direction

This section covers:

How to mark up content

Use the dir attribute on a block element to change the base direction of content in that block. Do not use CSS (the reasons will be given later.)

The picture here shows two paragraphs in a right-to-left document. Both paragraphs are identical except for the addition of dir="ltr" in the second.

The effect of using the dir attribute on a block container.

 View code.

The most obvious difference is that the second paragraph is now left-aligned. However, note, in particular, that the positions of the items on each line flow in opposite directions, because the base direction has been changed.. On the other hand, the characters within each word displayed are still read in the same direction. Their sequence is determined by the Unicode bidirectional algorithm, not by the dir attribute. (This will be explained more fully later).

The following is an example how to mark up a block element in a right-to-left document with a left-to-right base direction.

<blockquote dir="ltr" xml:lang="en" lang="en" cite="Romeo and Juliet (II, ii, 1-2)">But, soft! What light through yonder window breaks? It is the east, and Juliet is the sun.</blockquote>

Tables.

The dir attribute setting also affects the flow of columns in a table. The following picture shows instances of the same table in a right-to-left document (ie. the html tag includes dir="rtl"). The table, like other block elements, is right-aligned.

A table in a context with a right-to-left base direction.

View code.Picture of table.

In the table just below, the code dir="ltr" has been added to the table element, like this:

<table dir="ltr"> … </table>

Note how the order of columns has changed, how the contents of the cells are now left aligned (look at the numbers), and how the flow of words within each cell is now left-to-right (although the words themselves are still read, character by character, in the same direction).

A table in a context with a right-to-left base direction, but with dir="ltr".

View code.Picture of table.

What hasn't changed, however, is the alignment of the table itself within its containing block. It is still over to the right.

If, for some reason, you wanted the table to appear over on the left, you need to wrap it in something like a div element, and add the dir="ltr" to that element. This is illustrated in the third instance of the table in the picture, which is now left-aligned.

A table in a context with a left-to-right base direction.

View code.Picture of table.

The markup looks like this:

<div dir="ltr"><table> … </table></div>

Note that we don't have to repeat the dir attribute on the table itself.

Don't go markup crazy

Having established the base direction at the html tag level, you should not use the dir attribute on other elements unless you want to change the base direction for that element. The same applies for inline markup. Do not use inline bidi markup unless the Unicode bidi algorithm is insufficient on its own (such cases will be explained shortly).

As noted in the section How to mark up the document, occasionally you may choose not to use the html element. If this is the case, you should apply the direction to another high level element, from which the direction can be inherited (see above).

Unnecessary use of the dir attribute impacts bandwidth and potentially creates unnecessary additional work for page maintenance.

The Arabic example on the following picture shows bad usage. None of the dir attributes are needed if dir="rtl" has been added to the html element.

Improper block markup.

Picture of code with dir attributes on every element.

Removing superfluous markup will significantly simplify the document, and reduce bandwidth requirements.

Bidi algorithm basics

This section covers:

This section covers the basics of the bidirectional algorithm in sufficient detail to clarify the recommendations that follow. It is by no means exhaustive, but at the same time, understanding these concepts will probably help you understand and deal with most problems you face.

Base direction (directional context)

Before we go any further, let's repeat the description of a fundamentally important concept for all that follows. The result of the bidirectional algorithm will depend on the overall base direction of the paragraph, block or page in which it is applied. It establishes a directional context which the bidi algorithm refers to at various points to decide how to handle the text.

In HTML the base direction is either set explicitly by the nearest parent element that uses the dir attribute, or, in the absence of such an attribute, is inherited from the default direction of the document, which is left-to-right (LTR).

In HTML the base direction is either set explicitly by the nearest parent element that uses the dir attribute, or, in the absence of such an attribute, is inherited from the default direction of the document, which is left-to-right (LTR).

To set the default direction of the whole HTML document to right-to-left, add dir="rtl" to the html tag. This will mean that all elements in the document will inherit a base direction of LTR, unless the dir attribute is used on an element to change the base direction within that element's scope.

Characters and directional typing

We already know that a sequence of Latin characters is rendered (ie. displayed) one after the other from left to right (we can see that on in this paragraph). On the other hand, the bidi algorithm will render a sequence of strongly typed RTL (right-to-left) characters one after the other from right to left.

Directional typing.

View code.Examples of directionally typed words.

This is independent of the current base direction, and works because each character in Unicode has an associated directional property. Most letters are strongly typed as LTR. Letters from right-to-left scripts are strongly typed as RTL.

Directional runs

When text with different directionality is mixed inline, the bidi algorithm makes a separate directional run out of each sequence of contiguous characters with the same directionality.

So in the following example there are three directional runs:

Directional runs.

View code.Left-to-right ordered directional runs: bahrain مصر kuwait.

Another way of looking at this is that changes in direction mark the boundaries of directional runs.

Note that you don't need any markup or styling to make this happen.

Here's the important bit: the order in which directional runs are displayed across the page depends on the current base direction.

In the example above, which has an overall context (ie. base direction) of LTR, you would read 'bahrain', then 'مصر', then 'kuwait'.

Directional runs with LTR base direction.

View code.Left-to-right ordered directional runs: bahrain مصر kuwait.

If you change the directional context of the example above by specifying that the html element or a parent element is RTL, you will change the order of the directional runs.

Directional runs with RTL base direction.

View code.Left-to-right ordered directional runs: bahrain مصر kuwait.

The characters in both cases are stored in memory in exactly the same order, but the visual ordering of the directional runs, when displayed, is reversed.

Neutral characters

Spaces and punctuation are not strongly typed as either LTR or RTL in Unicode, because they may be used in either type of script. They are therefore classed as neutral or weak characters. Characters are usually classified as 'weak' when they are associated with numbers. A small number of characters punctuation characters are initially classed as weak, but in a non-numeric context are treated like neutrals. In consquence, in this tutorial we will refer to all punctuation as neutral characters.

This is where things begin to get interesting. When the bidi algorithm encounters characters with neutral directional properties (such as spaces and punctuation) it works out how to handle them by looking at the surrounding characters.

A neutral character between two strongly typed RTL characters will be treated as a RTL character itself, and will have the effect of extending the directional run. This is why the three arabic words in the following LTR phrase (including the intevening spaces, which as neutrals take on the direction of the surrounding characters) are read from right to left as a single directional run. (The arrows show the reading order.)

Basic behaviour of neutral characters.

View code.Arabic words in an English sentence: The title is مفتاح معايير الويب in Arabic.

Note that you still don't need any markup or styling for this. And that there are still only three directional runs here.

The really interesting part comes when a space or punctuation falls between two strongly typed characters with different directionality, ie. at the boundary between directional runs. In such a case the neutral character (or characters) will be treated as if they have the same directionality as the base direction.

Even if there are several neutral characters between the two different strongly typed characters, they will all be treated in the same way.

The implications of all this will become clearer as we work through the examples in the next section.

Summary.

Between 2 characters with same strong typing:
same directionality.

Between 2 characters with different strong typing:
directionality of base direction (ie. context).

Numbers

Numbers in RTL scripts run left-to-right within the right-to-left flow, but they are handled a little differently than words by the bidi algorithm in that they always run left-to-right. They are said to have weak directionality. The two examples in the picture illustrate this difference.

Numbers.

View code.one two ثلاثة 1234 خمسة  AND  one two ثلاثة ١٢٣٤ خمسة

The first example uses European digits, '1234', the second expresses the same number using Arabic-Indic digits, ١٢٣٤. In both cases, the digits in the number are read left-to-right.

Because it is weakly typed, the number is seen as part of the Arabic text, so the two Arabic words that surround the number are treated as part of the same directional run - even though the sequence of digits runs LTR on screen.

Note also that, alongside a number, certain otherwise neutral characters, such as currency symbols, will be treated as part of the number rather than a neutral. There are some other slight differences in the way numbers are handled that we don't need to discuss here.

Mixing text direction inline

This section covers:

There are three main scenarios that cause problems when dealing with bidirectional inline text. These are:

We look at these scenarios here and proposal some solutions.

Neutrals that appear at the wrong side of a directional run

We have seen that the bidirectional algorithm can cope well with a single level of bidirectional text, and that you could produce the result below without any additional markup or intervention:

Directional runs.

View code.Arabic words in an English sentence: The title is مفتاح معايير الويب in Arabic.

Unfortunately, neutrals between different directional runs can sometimes be misinterpreted. Let's type some punctuation at the end of the Arabic phrase in the last example. By default we will see the following:

Incorrectly placed exclamation mark..

View code.An exclamation mark appearing to the right of Arabic text.

The quotation marks look OK, but the exclamation mark is in the wrong position. It should appear at the end of the Arabic text, ie. to the left, like this:

Correctly placed exclamation mark.

View code.An exclamation mark appearing to the left of Arabic text.

Given our understanding of the bidi algorithm we can easily understand why this happened. Because the exclamation mark was typed in between the last RTL letter 'ب' (on the left)‌ and the LTR letter 'i' (of the word 'in') its directionality is determined by the base direction of the paragraph, ie. LTR in this case. (Note that it makes no difference that there are actually two punctuation characters and a space in this position - they are all neutrals and so are all affected the same way.)

Because the exclamation mark is seen as LTR it joins the directional run that includes the text 'in Arabic'.

So how do we get the punctuation in the right place? We'll explain in a moment, but first let's take a look at another common problem.

Nesting base directions

If you have a situation where embedded text, such as a quotation, is also bidirectional, then you will need help. The next picture shows a Latin sentence that contains a Hebrew quote which, in turn, contains both Hebrew and Latin text. This is how it would appear if you rely solely on the bidirectional algorithm.

Incorrect ordering in embedded text.

View code.Incorrectly ordered directional runs, because no embedding.

The order of the two Hebrew words is correct, but the because the text 'W3C' is part of the Hebrew phrase, it should appear on the left hand side of the quotation and the comma should appear between the Hebrew text and 'W3C'. In other words, the desired result is:

Correct ordering of embedded text.

View code.Correctly ordered directional runs, via embedding.

The problem arises because the directional flows are being ordered according to the LTR base direction of the paragraph. Inside the Hebrew quotation, however, the correct default ordering should be RTL.

To resolve this problem we need to explicitly change the base direction of the embedded phrase (ie. open a new embedding level).

Note: The examples shown here a fairly simple. Such sentences could commonly have more than two directional runs, in which case the issue is more obvious. Take, for instance, the following example where the top line shows the expected rendering, but the second line shows the default treatment using just the bidi algorithm.

A more complicated example of embedded text.

View code.More complicated embedded text.

A simple solution

A simple way to resolve both of the problems we just mentioned is to explicitly change the base direction of the embedded phrase. In HTML this would be done by enclosing the quotation in markup and assigning it a directionality of RTL using the dir attribute.

View code.<p>The title is "<span dir="rtl" lang="ar"> ... !</span>" in Arabic.</p>
View code.<p>The title says "<span dir="rtl" lang="he">...</span>" in Hebrew</p>
The editing environment you use may not show the exclamation mark in the right place in the code source, but it should look right when displayed.

Note carefully how the span tag falls inside the quote marks - these are part of the surrounding English text.

Note also that this is likely to be a simple solution because it is likely that there is already markup around the embedded phrase. Such markup may be used to semantically label the text, to add language markup, or perhaps add a class attribute to apply appropriate styling. In the example above, a span element was used to declare the language, so adding the dir attribute is simple.

What if I can't use markup?

There are some situations where you may not be able to use the markup described in the previous section. In HTML these include the title element and any attribute value.

In these situations you can use invisible Unicode characters that produce the same results.

To replicate the effect of the markup described in the example above related to nested base directions, we can use pairs of characters to surround the embedded text. The first character is one of U+202B RIGHT-TO-LEFT EMBEDDING (RLE) or U+202A LEFT-TO-RIGHT EMBEDDING (LRE). This corresponds to the markup <span dir="rtl"> or <span dir="ltr">, respectively. The second character is U+202C POP DIRECTIONAL FORMATTING (PDF). This corresponds to the </span> in the markup. Below you can see how to apply this to the previous example.

View code.<p>The title says "&#x202B;...&#x202C;" in Hebrew</p>
Because the characters are invisible you may prefer to actually type in a numeric character reference, as we have here.

These control characters should only be used for inline phrases, not for block elements such as paragraphs. In general, it is recommended that you use markup where it is available, rather than these character pairs, because it is easier to see and therefore manage the markup, and it is consistent with the approach used for block elements. Where markup is not available, of course, this is the only option.

When it comes to dealing with the misplaced neutrals described earlier, you can use the same approach. There is, however, a simpler alternative that works for cases such as the one shown, and is in fact recommended by the Unicode Standard rather than a pair of controls for these simple cases.

This involves placing an invisible, strongly-typed RTL Unicode character, after the exclamation mark. This puts our neutral punctuation between two strongly typed RTL characters, which results in the neutral becoming RTL too, and therefore the exclamation mark becomes a continuation of the right-to-left directional run.

The character designed for this purpose is the Unicode character U+200F RIGHT-TO-LEFT MARK (RLM). There is also a similar character, U+200E LEFT-TO-RIGHT MARK (LRM).

View code.<p>The title is " ... !&#x200F;" in Arabic.</p>

Note that in the example just shown the Arabic text is no longer marked up for language or styling. Also, because the character is invisible you may prefer to actually type in a numeric character reference (&#x200E;) as we did here, or a character entity (such as &lrm;).

Adjacent, same-direction directional runs that are incorrectly ordered

Neutrals between same directional runs can also sometimes be misinterpreted. In our next example the list order is incorrect. The first two Arabic words should be reversed and the intervening comma, which is part of the English text, should appear immediately to the right of the first word.

Neutrals between same direction text may be incorrectly interpreted as part of a single run.

View code.Bahrain appears to the left of Egypt in this list.

What was wanted was:

The way the sentence should have looked.

View code.Egypt appears to the left of Bahrain in this list.

The reason for the failure is that, with a strongly typed right-to-left (RTL) character on either side, the bidirectional algorithm sees the neutral comma as part of the Arabic text. It is interpreting the first two arabic words and the comma as a list in Arabic. In fact it is part of the English text, and should mark the boundary of two directional runs in Arabic.

In the previous section the neutral character thought it was part of the directional context established by the base direction, but wasn't; in this section the neutral character thinks it is part of the directional run, when it is really part of the overall context! No-one said life was simple...

Putting markup around the comma is a bit like cracking an egg with a hammer in this case.

A simple solution is to use an invisible, strongly-typed RLM Unicode character, next to the comma. This puts our neutral punctuation between strongly typed RTL and LTR characters and forces it to take on the directionality of the base direction, which is the left-to-right of the English text. That breaks the Arabic words into two separate directional runs, which are then ordered LTR in accordance with the base direction of the paragraph.

In the following example an escaped version of the character has been added after the exclamation mark and the result looks fine:

View code.<p>The names of these states in Arabic are ...,&#x200E; ... and ... respectively.</p>

More examples

The examples we have used so far have been English and LTR based. The same principles apply for RTL text in languages such as Hebrew and Arabic.

Example: Unfortunately, on its own the bidirectional algorithm creates a real mess of the following text, which is in a right-to-left paragraph. (The red superscript numbers are just part of the diagram, not the text, and are there to identify the parentheses.)

View code.Parentheses and Latin text incorrectly ordered.

Here's what we ought to see.

View code.Parentheses and Latin text correctly ordered

This problem was about adjacent, same-direction directional runs that are incorrectly ordered.

Although it may not be immediately obvious, actually the solution is trivial. Just insert an RLM after 'W3C' and you're done. It's really that simple!

If you're not convinced, here's the explanation. Unfortunately this will take a little longer to write than the fix.

Initially the parenthesis labelled  1 was between two LTR-typed characters, so its directionality was also LTR. This makes 'W3C (World Wide Web Consortium" a single directional run. (Don't worry about the shape of the parentheses for now - this will be explained shortly.)

The insertion of the RLM after 'W3C' changes the directionality of the parenthesis. Now it is between strongly-typed LTR and RTL characters, and so it takes on the directionality of the base direction, ie. the RTL direction of the paragraph as a whole. This makes sense if you see the parentheses as part of the Hebrew sentence syntax. The other parenthesis is also RTL, since it already appears between Latin and Hebrew characters.

This means we now have three directional runs at the beginning of the text. In memory they are ordered as follows: 'W3C', a RTL parenthesis, and the text "World Wide Web Consortium". Since the base direction for this paragraph is right-to-left, those runs are ordered from right to left - giving us the order we expect.

Example: The picture below shows what you are likely to see when relying solely on the bidirectional algorithm to display a MAC address number in a right-to-left context.

View code.MAC address incorrectly ordered.

The next picture shows the expected result.

View code.MAC address incorrectly ordered.

This is particularly worrisome, since it's not obvious that the non-hinted order is incorrect.

Although there are more characters involved, this problem is about neutrals that appear at the wrong side of a directional run. The same approach can be used to fix this. You can either put markup around the MAC address and use dir to set a different base direction, or you can put an LRM at the beginning of the number.

Example: The picture below shows the the unexpected result of displaying a telephone number in a right-to-left context, where the area code is surrounded by parentheses, and where the number appears at the beginning of a line or after some right-to-left text.

View code.Telephone number area code incorrectly ordered.

The next picture shows what you expect to see.

View code.Telephone number area code incorrectly ordered.

Because these are numbers, the order applied by the bidirectional algorithm is slightly different from what we've seen before, but the fix is essentially the same. The correct rendering can be produced by adding a LRM character or escape just before the first parenthesis, or by using markup around the number and setting the base direction to LTR.

Mirrored characters

You may have noticed that, in addition to changing position, one of the parentheses in the previous example actually changed shape, too. This was completely automatic and happens because these characters are what are known as mirrored characters in Unicode.

Mirrored characters.

Illustration of matching parentheses.

The shape of a mirrored character when displayed is dependent upon whether it is displayed in a LTR or RTL context. You do not have to change the character.

This means that, whether inputting content in Arabic/Hebrew or Latin script, you would use the same LEFT PARENTHESIS character at the beginning of the parenthesized text. In other words, treat mirrored characters as if any word left in the name meant 'opening', and right meant 'closing'.

The 'missing space' phenomenon

Spaces between directional runs text may appear to collapse at the boundary of an embedding if there is a space just before the end tag of the inline element that surrounds the embedded text. Here is an example.

View code.An example of text that is apparently missing a space.

Here is the source text that produced that result.

<p>The title is <span dir="rtl" lang="he">... W3C </span> in Hebrew.</p>

Note carefully the space between the C of W3C and the < of the following </span>. This is what causes the effect. If you simply eliminate that space, you get what you expected, which is what is shown next.

View code.Text looking normal.

Although this seems paradoxical, that an extra space can cause a missing space, it is not a bug. For a detailed explanation of why this happens see the article Bidi space loss.

The solution to this problem is to remove all whitespace from before the end tag of an inline element that changes the base direction.

Overriding the bidi algorithm

There may be occasions where you don't want the bidi algorithm to do its reordering work at all. In these cases you need some additional markup to surround the text you want left unordered.

In HTML and XHMTL 1.0 this is achieved using the inline bdo element. Again, there are Unicode control characters you could use to achieve the same result, but because they create states with invisible boundaries this is not recommended.

Using the bdo element.

View code.Shows Hebrew text in the order stored in memory.

The example on the picture shows Hebrew text as ordered in memory, and uses the bdo tag to achieve that effect, ie.

<p><bdo dir="ltr"> ... </bdo></p>

What about CSS or Unicode control characters?

This section covers:

HTML or XHTML-served-as-text/html

CSS 2.0 provides properties to specify bidirectional behaviour. These include:

unicode-bidi: embed/bidi-override

direction: ltr/rtl

Often people writing Arabic or Hebrew pages are inclined to rely on CSS associated with an element such as p or span to achieve the desired effect.

This is wrong on several counts:

  1. You should use the dedicated markup provided, ie. the dir attribute and the bdo element, in order to preserve the information even if stylesheets are not supported by a particular user agent.

  2. The HTML specification specifies the expected behaviour of user agents dealing with bidi markup. In other words, the user agent should know how to deal with the bidi markup defined by HTML without the need for any expression using CSS syntax.

  3. The CSS specification actually recommends the use of dedicated bidi markup, and says that conforming HTML user agents may ignore CSS bidi properties. This implies that if you rely on CSS for bidirectional behaviour in HTML or XHTML, you may not always get the results you expected.

The conclusion is that for content that is treated as HTML by the user agent, you should use the markup provided and not use CSS at all for producing correct bidirectional display.

For more information on this topic see CSS vs. markup for bidi support on the W3C Internationalization site.

Unicode characters or markup?

Unicode provides special, invisible formatting codes to build on or override the outcome of the bidirectional algorithm in plain text, in the same way as the HTML markup described in this tutorial.

There are a number of control characters in Unicode that can be used to create the same effect as markup for bidirectional text. These are listed in the following table:

Character Code Equivalent markup
RLE U+202B dir="rtl"
LRE U+202A dir="ltr"
RLO U+202E <bdo dir="rtl">
LRO U+202D <bdo dir="ltr">
PDF U+202C nothing
</bdo>

Both Unicode in Markup Languages and the HTML 4.01 specification advise against using these when markup is available, and they particularly advise against mixing control codes and markup. The main reason for this is that in most editors and source text views the codes are invisible, so it is difficult to tell where they are if you need to change the text, and it is easy to inadvertently overlap controls or end up with an odd number.

For more information on this topic see (X)HMTL & bidi formatting codes vs. markup.

There are, however, some situations where Unicode control characters provide the only means to express directionality. One such is in the title element at the top of the page. This element is defined to support only characters, no markup. It is therefore not possible to use the dir attribute or the bdo element on a part of the title text.

Attribute text, too, cannot be marked up for directionality, so Unicode control characters have to be used to indicate directionality.

Note that other things, such as language, cannot be marked up in these constructs either.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

Author: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2005-03-22. Last substantive update 2009-07-10 7:07 GMT. This version 2011-05-04 7:34 GMT

For the history of document changes, search for tutorial-bidi-xhtml in the i18n blog.