100% roundtrip ASCII? 100% roundtrip ANSI?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

100% roundtrip ASCII? 100% roundtrip ANSI?

  • Comments 15

Back in January I was talking about the new compiler error C4819 and how the compiler detected invalid characters.

And anyone who has been reading here knows that the reverse solidus is always the path separator, even when it looks like a yen or a won.

So among the so-called 'ANSI' code pages, ASCII (0x00 - 0x7f) will roundtrip 100% of the time.

How many "invalid" slots are there in the 'ANSI' code pages in the 0x80 - 0xff range, exactly?

Let's take a look at the Windows code pages:

There you have it. Code page 1256 is the only one that is guaranteed to be able to roundtrip every single code point without losing any of the bytes....

 

This post brought to you by "¿" (U+00bf, INVERTED QUESTION MARK)

Comment on the blather
Leave a Comment
  • Please add 3 and 7 and type the answer here:
  • Post
Blog - Comment List
  • 100% roundtrip ANSI is a novel concept -- I am not sure of the purpose but anyway I noticed you glossed over the fact that the double byte sets are hard to quantify the same as the single byte ones by saying "up to" rather than trying to get into how the lead bytes depend on the trail bytes, and a byte like 0x30 that looks like ASCII can fail the round-trip if it is a trail byte. (geeks!)
  • The 100% roundtrp question comes up in ways to encode bytes without having to worry about what someone may or may not consider invalid (UTF-8 has rules as does UTF-16, and they get stricter all the time). Nice that the Arabic code page stands before us as a safe haven from all that meddling. :-)

    I suppose I could have written a program to go through each of the trail bytes for those lead bytes and gotten the % of invalids per lead byte and maybe averaged them, but it seemedlik a strange exercise that may not yield much additional info.

    In the end, I guess I was mostly being lazy. Vacation does that....
  • You say that 1252 has 1 invalid character. I count 5.
  • Whoops, you are correct -- sorry, small transcription error!
  • I am not sure I understand.
    Roundtrip to what?
  • So, to use UTF-8 with the current/next Microsoft compiler you need to tell the OS that your locale uses Arabic codepage 1256 ?

    And that's a "cool feature" ?
  • Nick, huh? That is not what I said.

    What I am saying is that there are times when a person may not be sure of the code page. If you are not, then assuming it is UTF-8 and converting is guaranteed to cause problems -- because illegal sequences cannot be emitted. But cp1256 is a code page you can szafely roundtrip any byte sequence through without losing any bytes because all bytes are legal there.

    Obviously if you know the code page you do not need this. So the times that this is needed will hopefully be rare. But most people roundtrip data through code pages 1252 when they have to do this sort of thing, which is incredibly dangerous since you can actually lose information!
  • Maybe I'm jumping ahead here...

    1. C4819 is generated for input which contains "invalid characters"

    2. Windows isn't really UTF-8 capable because, well basically because Windows wasn't very well designed twenty years ago.

    3. So to avoid C4819 you need a locale where all your 8-bit data, which Windows can't conceive of as Unicode, is "valid" even if meaningless.

    4. In this post we find out that the locale needed is cp1256, Arabic.
  • Some people may recall when I talked about how It does not always pay to be compatible. In that post...
  • Riffing on Raymond here, and his post On the fuzzy definition of a "Unicode application"....
    The points...
  • Francisco Moraes asked in the Suggestion Box:

    Are there any code pages (exception EBCDIC) where the...
  • Julien asks via the Contact link: Dear Mr Kaplan, I would like to display Japanese Characters in the

  • So it all started in a conversation with some of the folks from the SQL Server team when I was at PASS.

  • The short internal links are dead...

  • Yes, six years later, they moved everything. You have to go the new site to find them now....

Page 1 of 1 (15 items)