Holy cow, I wrote a book!
In Windows, we spend a good amount of time with the
And one of the things that you notice is that pseudo-mirrored
text comes out looking really weird.
For example, the string
comes out pseudo-mirrored as
Just for fun, here's
here's how your browser renders it:
the IPv6 address
comes out as
(The IPv6 address was the string that
prompted this article.)
The result of the RTL IPv6 address is even weirder if you force
a line break at a particular point.
If your browser follows the Unicode Bidi algorithm,
you can resize the box below to see how the line break position
affects the rendering.
If your browser doesn't follow the Unicode Bidi algorithm,
or if you can't resize the window, here's what you get:
"Is this a bug?"
Well, maybe yes.
But mostly yes.
Windows is following the
Unicode Bidirectional Algorithm.
So the part that's not a bug is
"Windows is correctly following an international standard."
The weirdness you're seeing is just a consequence of
following the standard.
Let's look at what's going on.
When you render text in RTL context, what you're saying is
"Render this text in the form you would see it if it appeared
in a newspaper printed in an RTL language."
For illustration, we follow the convention that uppercase
characters are considered to be in an RTL script, lowercase
characters are considered to be in an LTR script,
and non-letters stand for themselves.
Say you want to render the string
"NEXT COMES john smith."
A newspaper would say,
"Well, my readership expects things to be laid out right to left.
The string 'john smith'
is a foreign name inserted into a paragraph that otherwise is written
my readers' native language.
If the name were in my readers' native language, I would render it as
.HTIMS NHOJ SEMOC TXEN
Since the name is in a foreign language, I will treat it as an
opaque 'name blob' that got inserted into
my otherwise beautiful RTL sentence."
.john smith SEMOC TXEN
(The black outline is not part of the actual output.
I am using it to highlight that the phrase john smith is
being treated as a single unit.)
This also explains why "hello." comes out as
The LTR text is treated as a blob inside an otherwise RTL sentence.
Things get weirder once parentheses and digits and more complex
punctuation marks are thrown into the mix.
the Unicode Bidirectional Algorithm has to figure out that
in the text
"IT IS A bmw 500, OK." the "500" is attached to the LTR text "bmw",
.KO ,bmw 500 A SI TI
And it also needs to work out the correct text rendering order
when you have
RTL text embedded inside LTR text,
all of which is embedded inside other RTL text,
as illustrated by the brain-teaser
"DID YOU SAY ’he said “car MEANS CAR”‘?"
But maybe the standard is buggy.
The problem is that the Unicode Bidirectional Algorithm
is designed for text,
so when you ask it to render things that aren't text
(such as IPv6 addresses and URLs),
the results can be nonsensical.
At least for the IPv6 case, you can work around the problem
by explicitly marking the IPv6 address as LTR,
so that the Unicode Bidirectional Algorithm doesn't get involved,
and the characters are rendered left-to-right in the order they
Study the Unicode Bidirectional Algorithm and explain why
comes out as
What you need to know about the bidi algorithm and inline markup.