For those who enjoy mathematics (or, 'Also new in Vista')

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

For those who enjoy mathematics (or, 'Also new in Vista')

  • Comments 20

Another one of those "new in Vista" posts. :-)

Unicode has added a great deal to support mathematics, from Unicode Technical Report 25 (Unicode Support for Mathematics) to the various Mathematical subranges in Unicode (see the Mathematical Symbols column in the Code Charts for Symbols and Punctuation).

My favorite range is the Mathematical Alphanumeric Symbols block in Unicode, which currently has all of the characters from U+1d400 to U+1d7ff (almost 1000 in all, with some spaces that were left in, as you can see from the code chart).

Why is it my favorite?

Well, I was having a conversation a few years back with Murray Sargent of Microsoft (one of the representatives of MS at Unicode Technical Committee meetings and a co-author of UTR #25). He was explaining why Unicode, which is generally speaking a plain text standard, was going to approve a block of characters that included many different letters and numbers with bold, italic, and other variations usually reserved for "rich text" outside the scope of Unicode.

"It is all about mathematics, and representing it in plain text," he explained. And he has a point; while I may use bold or italicized text for emphasis, in mathematics there is actual semantic meaning that is expressed in symbols an variables that have such attributes.

At that point, thinking about collation, I asked him if there was ever a time that it would be interesting or important to fold those differences together, for all of the following:

  • U+0041 ("A", LATIN CAPITAL LETTER A)
  • U+0061 ("a", LATIN SMALL LETTER A)
  • U+1d400 ("๐€", MATHEMATICAL BOLD CAPITAL A)
  • U+1d41a ("๐š", MATHEMATICAL BOLD SMALL A)
  • U+1d434 ("๐ด", MATHEMATICAL ITALIC CAPITAL A)
  • U+1d44e ("๐‘Ž", MATHEMATICAL ITALIC SMALL A)
  • U+1d468 ("๐‘จ", MATHEMATICAL BOLD ITALIC CAPITAL A)
  • U+1d482 ("๐’‚", MATHEMATICAL BOLD ITALIC SMALL A)
  • U+1d49c ("๐’œ", MATHEMATICAL SCRIPT CAPITAL A)
  • U+1d4b6 ("๐’ถ", MATHEMATICAL SCRIPT SMALL A)
  • U+1d4d0 ("๐“", MATHEMATICAL BOLD SCRIPT CAPITAL A)
  • U+1d4ea ("๐“ช", MATHEMATICAL BOLD SCRIPT SMALL A)
  • U+1d504 ("๐”„", MATHEMATICAL FRAKTUR CAPITAL A)
  • U+1d51e ("๐”ž", MATHEMATICAL FRAKTUR SMALL A)
  • U+1d538 ("๐”ธ", MATHEMATICAL DOUBLE-STRUCK CAPITAL A)
  • U+1d552 ("๐•’", MATHEMATICAL DOUBLE-STRUCK SMALL A)
  • U+1d56c ("๐•ฌ", MATHEMATICAL BOLD FRAKTUR CAPITAL A)
  • U+1d586 ("๐–†", MATHEMATICAL BOLD FRAKTUR SMALL A)
  • U+1d5a0 ("๐– ", MATHEMATICAL SANS-SERIF CAPITAL A)
  • U+1d5ba ("๐–บ", MATHEMATICAL SANS-SERIF SMALL A)
  • U+1d5d4 ("๐—”", MATHEMATICAL SANS-SERIF BOLD CAPITAL A)
  • U+1d5ee ("๐—ฎ", MATHEMATICAL SANS-SERIF BOLD SMALL A)
  • U+1d608 ("๐˜ˆ", MATHEMATICAL SANS-SERIF ITALIC CAPITAL A)
  • U+1d622 ("๐˜ข", MATHEMATICAL SANS-SERIF ITALIC SMALL A)
  • U+1d63c ("๐˜ผ", MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A)
  • U+1d656 ("๐™–", MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL A)
  • U+1d670 ("๐™ฐ", MATHEMATICAL MONOSPACE CAPITAL A)
  • U+1d68a ("๐šŠ", MATHEMATICAL MONOSPACE SMALL A)

At first, Murray thought I was trying to make them all equal, and objected strenuously to that; luckily I had something different in mind. I pointed out some scenarios:

  • Creating an index or list in a paper and wanting to sort these mathematical elements in the same order as the letters
  • Trying to search for one of the elements wihout needing to type the specific code point

And he definitely saw the benefit to such a collation.

So, after this conversation (and a few others with other various math experts), in Vista a special LCID is being added:

0x0001007f (MAKELCID(MAKELANGID(LANG_INVARIANT, SUBLANG_NEUTRAL), SORT_INVARIANT_MATH))

It is an alternate sort for the invariant locale, because mathematics is independent of specific locale (kind of like invariant is!).

This locale causes each of the above letters to be a mere secondary and/or tertiary difference away from everything else on the list. The same principles were applied to all of the Greek letters and numbers in the block.

Please note that this is not something that can be selected in Regional and Language Options as a locale (neither can invariant, so obviously an alternate sort of invariant cannot be chosen). But it can be used in any programmatic situation where one is looking to compare strings, find within strings, or create sort keys.

And it is right there in Vista, for those who are mathematically inclined....

 

This post brought to you by "๐€(U+1d400, a.k.a. MATHEMATICAL BOLD CAPITAL A)

Comment on the blather
Leave a Comment
  • Please add 8 and 8 and type the answer here:
  • Post
Blog - Comment List
  • Could these new characters have implications for international domain names? "MATHEMATICAL SANS-SERIF CAPITAL A" could look rather similar to "LATIN CAPITAL LETTER A" in a URL. Also, a URL with a mix of mathematical and Latin characters might not be flagged up as a URL containing characters from multiple languages because, in a sense, it doesn't.
  • Excellent question, Jon!

    For most people, I would say no (since you have to have one of those math-specific fonts and it would be unlikely to display with glyphs). But the math symbols would definitely give cross-script errors since they are from different scripts (one Latin, one Common), unless you pass the flag to ignore the Common script range....
  • IDN is a per-registry issue. Each registry must write and enforce its own policies on acceptable names, because obviously allowing the entire Unicode range is simultaneously pointless and dangerous. I could write a long rant here about Network Solutions, but I think it would be redundant. Suffice to say that in a well run registry IDN fraud attempts are not likely to be a huge problem and that the public gTLD registries are not well run. Look to a European ccTlD registry like Nominet for a contrasting example.

    Of course it would probably have been better to leave IDN as an experiment and put up with uninformed Totoro fans whining about why they can't register ใƒˆใƒˆใƒญ.com forever but it's too late for that now. We regret these mistakes at our leisure.
  • Great news! Math is getting its own LCID. Hopefully it won't take long, and there will be a math IME, too, so we can type equations with those characters.
  • What meaning does Bold/etc exaclt have in Math? I've never heard of this before...
  • Hi Robert -- I hear you. I know of people who have used MSKLC to create keyboards for them. Although you cannot cover all characters, few mathematicians would need to use all of them at once, so 3 keyboards do the trick....
  • Hi Jonathan --

    I am going to try to get someone qualified to discuss that point more fully come up and talk about it, my knowledge is limited to a few well-known math constants. :-)
  • I believe you can also look at the text in UTR25 for some examples of when they are used....
  • Jonathan: Vectors is one thing that would commonly be written in bold.

    Now we just have to wait for the huge braces that can contain several lines to use for matrices...
  • Andreas -- there are fomatting programs that will properly create such huge braces etc. based on metadata that decribes how to best display things.
  • Hi Nick -- We cannot just make it a registry issue, we need this covered at all levels. Certainly it must start there, but it cannot end there. As for "would probably have been better to leave IDN as an experiment" I am forced to disagree. We call it WWW - WORLD WIDE web. So we need to support the entire world, something we were not doing previously, and really needed to. Is the problem harder? Sure. But we cannot refuse a problem that must be solved just because it is challenging.
  • As Andreas points out, bold is commonly used for vectors, although some authors prefer to represent vectors by letters with arrows above them. The mathematical alphanumerics, particularly the serifed italic, bold, bold italic, script, and Fraktur sets, are proving to be very useful in a math display and editing system some of us are working on. Such work is complicated a bit by the holes Michael refers to which were introduced because some of the math alphabetics already existed in Letterlike block and the Unicode Technical Committee doesn't like to exacerbate the multiple-character-same-glyph problem. Large braces are nicely handled via glyph variants, along with other special characters like superscripted primes and sub/superscript glyphs in general. Note that pieces to make large braces, brackets, etc., exist in Unicode, such as U+239B - U+23B1.<BR><BR>Re IDN, hopefully the math alphanumerics will be illegal in domain names; there are already plenty of spoofing opportunities without adding any more (although at least some of the math alphas look quite different, e.g., the Fraktur and script symbols).<BR><BR>One way to enter Unicode's vast math symbol set is as in TeX: \alpha inserts an alpha (actually a math italic alpha), \int inserts an integral sign, \fH inserts a Fraktur H, etc. If your editor has an autocorrect facility, you can define your own combinations for keyboard entry.
  • Way off-topic but Michael brought it on himself - (deleted question)&nbsp;Put it in the Suggestion Box, Jerry!
  • Re: IDNs

    Unicode has certaonly recognised the issue; the able of confusables (found alongside the other character data tables on http://www.unicode.org/) certainly references the mathematical letters pretty heavily.

    As to what they mean... depends on the branch of maths you're dealing with; these symbols are pretty heavily overloaded. Notable excepotion being for blackboard-bold (or double-struct) N, Z, Q, R, C which are used for the sets of natural, integral, rational, real and complex numbers. (And why they appear in the BMP, and leave wholes in the Mathematical Alphanumeric Symbols block.)
  • Hey Richard -- I don't disagree, I was just pointing out that if you use the mitigation tools we provide, then the situation is detected and handle-able.
Page 1 of 2 (20 items) 12