What is the difference between Big Endian and Little Endian Unicode?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

What is the difference between Big Endian and Little Endian Unicode?

  • Comments 13

A very common question that comes up has much to do with the meaning of the suffixes in UTF-16LE and UTF-16BE.

It all comes back to the way processors work. When you look at a byte (like 0x41) it is easy to say you know what it is. But when looking at two bytes in a row (like 0x41 0x00) as if it were a single 16-bit WORD you have to decide if you are looking at the number 0x4100 or the number 0x0041.

I always found the clearest description came from Bruce McKinney's Hardcore Visual Basic:

Endian refers to the order in which bytes are stored. The term is taken from a story in Gulliver’s Travels by Jonathan Swift about wars fought between those who thought eggs should be cracked on the Big End and those who insisted on the Little End. With chips, as with eggs, it doesn’t really matter as long as you know which end is up.

And indeed, it is pretty crucial to know which end is up. This is especially interesting for UTF-16, which in the end is a bunch of arrays of WORDs that happen to correspond to characters in Unicode. The difference between U+0041 ("A", a.k.a. LATIN CAPITAL LETTER A) and U+4100 ("䄀", a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune) is quite striking!

On Windows platforms, which are mostly little endian, UTF-16LE is just called "Unicode" and UTF-16BE is just called "Unicode (Big Endian)". Which is much less confusing for the majority of people who do not work cross-platform.

(Speaking frankly, this does not bother me much -- anyone smart enough to be annoyed by the terminology is smart enough to know that not everyone is as smart as they are in these matters)

For more information, simple web searches with the following search string:

"big endian" "little endian"

will return enough results to keep one busy for some time...

 

This post brought to you by "䄀" (U+4100, a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune)

Comment on the blather
Leave a Comment
  • Please add 7 and 3 and type the answer here:
  • Post
Blog - Comment List
  • Put me down, you Brobdingnagian blunderbuss!

    (before you delete this, this is at least tangentially on topic!)
  • I posted it, Stewie. :-)
  • Byte Order Mark time! Unfortunately, there's absolutely no standardization on when to use it, just the convention that if you don't encounter it you should assume the host endianness. And it's even funnier when you encounter BOMs in mid-text, such as when you've used cat to combine two files produced on machines of different endianness... oh, and if you're transcoding, under what conditions do you prefix, or remove, the BOM?
  • Luckily, reports of problems are fairly overblown. :-)

    If you concatenate two files then sure you *ought* to remove it, but if you proceed without removing it than all that happens is that an invisible character with zero width is there -- which does not matter.

    If they are of different endianness and a tool combines them then thast is a bug for the tool -- as you should never combine two such files.
  • No one ever accused the Universal Character Set of being simple.
    Just short of 100,000 characters, many...
  • This is an issue that has been around for a long time.
    Back in February (geez, I really have been blogging...
  • We’ve seen more activity on our MSDN Forum over the past couple weeks (yay!) and there have been a few

  • "Little Endian" means that the lower-order byte of the number is stored in memory at the lowest address, and the high-order byte at the highest address. For example, a 4 byte Integer

    Byte3 Byte2 Byte1 Byte0

    will be arranged in memory as follows:

    Base Address+0 Byte0
    Base Address+1 Byte1
    Base Address+2 Byte2
    Base Address+3 Byte3

    Intel processors (those used in PC's) use "Little Endian" byte order.

    "Big Endian" means that the high-order byte of the number is stored in memory at the lowest address, and the low-order byte at the highest address. The same 4 byte integer would be stored as:

    Base Address+0 Byte3
    Base Address+1 Byte2
    Base Address+2 Byte1
    Base Address+3 Byte0

    Motorola processors (those used in Mac's) use "Big Endian" byte order.

  • A little late, no? :)

    Also note that Macs are moving to Intel non-Motorola these days...

  • which is greater in number little or big endian

  • Neither is "bigger" -- they represent the same number, represented differently in how it is encoded.

  • How can I open a file once I have saved it as unicode big endian?

  • Diane, see the answer here.

Page 1 of 1 (13 items)