Sometimes, we don't break for spaces...

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Sometimes, we don't break for spaces...

  • Comments 5

This blog today is about a character in Unicode.

U+00a0, aka NO-BREAK SPACE, specifically.

U+00a0, aka NO-BREAK SPACE, in its Code Chart view

I could have made it an Every Character Has a Story blog, almost.

Except it is really going to be about locales on Microsoft platforms, rather than a history and/or story of the character itself.

So I won't talk about the suggestion to Sri Lanka to use it in their Standards, or the role Unicode has it play in lone combining characters, or any of the other interesting stories about it.

Sorry!

To start, there is a regular space, which allows anyone rendering text to treat it opportunistically as a line breaking opportunity.

Like if you have more characters in a line then you have line, then it will break at one of those places -- perhaps on that space!

But if you put a NO-BREAK SPACE there, then it will not be used as a line breaking opportunity -- the text on either side will act as if it is just another letter or something.

I endeavored to explain to my girlfriend what U+00a0 does, and she suggested maybe it was like how she and I were connected. That'll work. :-)

Anyhow, if you look at all of the LOCALE data in Windows, there are ~185 instances of the NO-BREAK SPACE, U+00a0.

The ~185 instances fall into two categories:

  • In ~100 of them, it is used by itself for the 50 locales that have it as their LOCALE_STHOUSAND and LOCALE_SMONTHOUSANDSEP;
  • In the other ~85 of them, it is used to act as spaces within various language names, month names, calendar, country, currency, and day names.

Now that second category makes sense.

If one has a month name of كانون الثاني, one may genuinely want to not let it span lines.

And so on.

The first category also makes sense -- one may want to make sure that the number $100 000 000.00 or 45 678.00 doesn't get split up either.

In fact, one may wonder about the ~9 cases that are similar to category #1 that use U+0020 for their LOCALE_STHOUSAND or LOCALE_SMONTHOUSANDSEP, right? :-)

You have to wonder if some or all of those ~9 and of the other ~214 cases that fall into category #2 usages of U+0020 are mistakes that would also be U+00a0, if they had a chance to think about it!

And then there are a few other interesting cases:

  • The ~125 cases where LOCALE_IPOSITIVEPERCENT is 0 -- Number, space, percent; for example, # %
  • The ~3 cases where LOCALE_IPOSITIVEPERCENT is 3 -- Percent, space, number; for example, % #
  • The ~124 cases where LOCALE_INEGATIVEPERCENT is 0 -- Negative sign, number, space, percent; for example, -# %
  • The ~1 cases where LOCALE_INEGATIVEPERCENT is 9 -- Percent, space, number, negative sign; for example, % #-
  • The ~1 cases where LOCALE_INEGATIVEPERCENT is 10 -- Percent, space, negative sign, number; for example, % -#
  • The ~77 cases where LOCALE_ICURRENCY is 2 -- Prefix, 1-character separation, for example, $ 1.1
  • The ~80 cases where LOCALE_ICURRENCY is 3 -- Suffix, 1-character separation, for example, 1.1 $
  • The ~78 cases where LOCALE_INEGCURR is 8 -- Negative sign, number, space, monetary symbol (like #5, but with a space before the monetary symbol); for example, -1.1 $
  • The ~7 cases where LOCALE_INEGCURR is 9 -- Negative sign, monetary symbol, space, number (like #1, but with a space after the monetary symbol); for example, -$ 1.1
  • The ~1 cases where LOCALE_INEGCURR is 10 -- Number, space, monetary symbol, negative sign (like #7, but with a space before the monetary symbol); for example, 1.1 $-
  • The ~33 cases where LOCALE_INEGCURR is 12 -- Monetary symbol, space, negative sign, number (like #2, but with a space after the monetary symbol); for example, $ -1.1
  • The ~6 cases where LOCALE_INEGCURR is 14 -- left parenthesis, monetary symbol, space, number, right parenthesis (like #0, but with a space after the monetary symbol); for example, ($ 1.1)
  • The ~1 cases where LOCALE_INEGCURR is 15 -- Left parenthesis, number, space, monetary symbol, right parenthesis (like #4, but with a space before the monetary symbol); for example, (1.1 $)

All of these cases have one thing in common.

According to docs, they insert a SPACE (LOCALE_ICURRENCY calls it a "separation") in all of these cases, even if the LOCALE_STHOUSAND or LOCALE_SMONTHOUSANDSEP have U+00a0 in them.

Obviously either the docs are wrong or the code creates formatted strings that could be broken before the line ends even if the separators clearly try to avoid this.

I don't know about you, but both ideas fail to sit very well with me, entirely.

How about you?

I'm almost afraid to try. Almost....

Comment on the blather
Leave a Comment
  • Please add 1 and 3 and type the answer here:
  • Post
Blog - Comment List
  • So what appears to be a gap between you is actually an unbreakable bond?  How extremely romantic!

  • Correct, no mere space! :-)

  • I don't know... I mostly do web programming, so I need to turn them into   before displaying anyway.

  • Oops, not realizing the blog software doesn't escape it to   for me. :P

  • As someone who writes software that generates XSL Transforms, I'm more familiar with it as &‍#A0;

Page 1 of 1 (5 items)