Blog - Title

January, 2006

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    New in Vista: What's your name? Who's your daddy?

    • 14 Comments

    It is probable bordering on certainty that the Zombies were not singing about cultures in the .NET Framework in their song Time of the Season. Even in some kind of prophetic sense.

    And while I'm sure there are parents out there who may not be keen on the resource model that the .NET Framework uses (i.e. if resources do not exist for a particular language, check what its parents have!), the resource fallback model has some important customization abilities over simple string parsing of RFC-1766 or RFE-3066 names.

    But this post is not about any of that. This post is about Windows.

    I mean, it is easy to claim in the .NET Framework that CultureInfo names are where it is at, and that LCIDs are just there for backwards compatibility.

    But meanwhile back in Windows, the NLS API and all of its functions pivot on LCID.

    Well, they used to, I mean.

    Because Windows Vista supports new functions that take locale names rather than LCIDs!

    For now, I'll just list a bunch of the ones that are on the MSDN site online:

    And more!

    And of course there is the easy moving between names and LCIDs with LCIDToLocaleName and LocaleNameToLCID.

    Future posts will talk about various details of many of these new functions and the special things that some of them can do that go above and beyond their LCID-based versions.

    And just so you don't think the post title was entirely gratuitous:

    • There is an LCTYPE added for LOCALE_SNAME that will return the name of the locale, and
    • There is also an LCTYPE added for LOCALE_SPARENT that will return the parent name of a locale (its value is 0x0000006d and it is coming soon to a winnls.h near you!).

    So now any developer running Vista can ask of a locale those immortal questions first asked by The Zombies -- What's your name? Who's your daddy?

    And parents throughout the world will shudder at the precedent since those often neutral names that are the parents can be used for resource fallback, along with the new MUI functions for UI language that I will talk about another day..... :-)

     

    This post brought to you by "𝅘𝅥" (U+1d15f, a.k.a. MUSICAL SYMBOL QUARTER NOTE)

  • Sorting it all Out

    And while I'm on the subject, there is the rest of the world

    • 5 Comments

    Regular reader Maurits asked in the Suggestion Box:

    OK, no Tengwar or Klingon Unicode code points...

    What about an Esperanto localization of Windows? The Unicode Standard is available in Esperanto:

    http://www.unicode.org/standard/translations/esperanto.html

    And Esperanto is mentioned in the Microsoft Knowledge base (if only in passing...)

    http://support.microsoft.com/kb/95473/en-us

    He was actually responding to the post entitled Fictional could make things less functional, where I explained some of my thoughts on why it is best to let sleeping dogs lie for some proposals.

    The current question about Esperanto in Windows is obviously a separate issue from Unicode, since it mainly relates to Windows, and to Microsoft.

    It also poses a problem, a limitation in the locale model, which for present purposes we can define as

    • a unique combination of language, and a(n)
    • [occasionally implied] region, and a(n)
    • [often implied] script.

    But where is the location for Esperanto? Or for Yiddish?

    And to look at another, very different, category --

    What is the currency formatting for Egyptian Hieroglyphics? Or any number of other languages that will never make sense having a localized version of Windows for any purpose other than as a novelty.

    And then what about the over 60 languages of the Congo or the over 50 dialects of Quechua?

    Some may recall I brought up some of this issue in A subkultur iz a shprakh mit an armey un a flot -- and talked further about the limitations in the model that affect these and other genuinely interesting scenarios.

    The truth is that the current locale model, which has proven to be a very powerful one for the last 10+ years, will not be able to handle every new scenario over the next 10+ or 20+ years.

    Will it stay strong for those original scenarios? Definitely. But the next stage is how to fit in these other scenarios, as they come up.

    Which is a lot what both opening it all up and getting out of the way are all about. Because no matter how good of a job we can do, those language experts can do a better one. If only we give them a way to do so....

    Every time I don't like my job I think about this aspect of it and suddenly I find myself liking it again. :-)

     

    This post brought to you by "" (U+0f3a, a.k.a. TIBETAN MARK GUG RTAGS GYON)

  • Sorting it all Out

    One reason you should clean up warnings

    • 8 Comments

    (The following post is reprinted from a blog that is no longer around. Done so with the permission of the author -- special thanks to Mike Dolenga, for that permission and for having a cynical side that I find quite comforting since it makes me look like an optimist!)

    Compiler warnings can sometimes be more aggravating than useful.  We've all encountered them, and learn which ones to ignore in our code.  I think it's best to clean them up, either by fixing the code or suppressing the warning itself.  Suppression should be done with caution and should be clearly documented, of course.

    I was finally convinced of the need to clean up warnings at a previous employer a few years ago.  This company has since gone bankrupt, and what you're about to read is one of the reasons why.  It provided a database application to a specific industry, let's just imagine it was widget manufacturers.  The application had a lot of flexibility to it, too much in fact, and was sold to specific widget makers with customizations defined by them.

    One day, one of the widget makers called and said they were hitting a bug in the program which caused some of their data to be lost.  We investigated and found the bug.  The developer who owned the code checked in a fix, and we eagerly told the widget maker we would soon have an update ready for them, which we would courier overnight on a CD.

    The client received the CD, installed our program and called right away to tell us the desktop shortcut didn't work.  Luckily there were a few technically savvy people at the client's office who pointed out that our program was nowhere to be found on the system where they had just installed it.  Our used car salesman turned CEO was red faced and reflexively told the client the disc we sent must have been defective.  I found out later the client knew better, but it did buy us another day.

    We did some digging, and found out what happened.  First, the checkin made by the developer caused a build break.  This also meant he never tested his fix, but that's a separate issue.  The builder kick off a build and as was his habit, ignored the error, because there were so many warnings in the code base.  He never noticed that our main application failed to build.

    The next guy in the chain, who owned the installer package, also ignored the warning that the application did not exist when building his package.  Why?  Again, there were too many warnings and he just assumed all those lines flying by were the usual noise.  And finally, after burning the CD, the test team installed the build on a machine which already had an installation.  They didn't test the fix; instead, they simply logged in and made sure the application launched.  Why?  Management routinely refused to buy them the tools to quickly re-image an OS and have a clean system to install, and insisted this CD get out the door before the FedEx office closed for the day.  The same management which refused to buy additional build hardware so that we could build on something other than a Pentium 90 (400s were standard at the time).

    While the failures here were numerous, a key component was all the warnings in the code.  They made everyone sloppy and complacent.  If the builder had noticed the application never built, that could have avoided all the problems.  If you go from zero errors and warnings to some, that's a lot easier to notice than an extra message out of thousands.

    As a footnote, I have no idea whether the fix worked.  Not only did I quit the next week (I already had an offer elsewhere), but the client cancelled the contract.  The ensuing litigation was one of the things which finally put an end to the software company.

  • Sorting it all Out

    Ā was unexpected at this time.

    • 8 Comments

    (Special thanks to Dave Poole for pointing this one out!)

    It was not too long ago that I got mail from Dave about a strange error he was getting in some automated tests of various SQL Server command line tools. An error that occurred with a command line including LATIN CAPITAL LETTER A WITH MACRON (Ā).

    I am used to getting random mails of a particular class from people (such as Unicode characters not working in the console), so I started him down the troubleshooting road with chcp and default system locale and so on.

    But he persevered and when he proved that he was entering things correctly and I looked at the problem, I saw that he was right. Typing an 'Ā' always had this particular problem:

    while typing anything else did not:

    Since Dave is on the SQL Server team and I am on Windows, I figured I should go find out owns CMD.EXE.

    It turns out that is none other than Mark Zbikowski, Architect and man whose initials appear in every binary that is not one of those .COM files.

    By many accounts, he is the third most "senior" employee working at Microsoft today.

    Even though I had actually conversed with him before (about those casing table and NTFS issues!), I did not want to waste his time or take advantage of the fact that he is by all reports and in my personal experience a really nice guy.

    So there was a brief delay while I tried to look into the bug a bit before sending mail to someone who had been here approximately ten times longer than me. :-)

    I found a constant with the value 0x0100 in the code but could not really tell how the lexer and parser work enough to see a problem with an overlap between them. So I finally sent Mark some mail and asked him.

    As it turns out, Mark and I have at least one thing in common -- he is also used to getting random mails of a particular class from people (such as specific characters not working in the console). So he started me down the road of troubleshooting when text is munged before it ever gets to CMD's parser.

    But I persevered and when I proved that I was entering things correctly and he looked at the problem, he saw that I was right. Typing an 'Ā' always had this particular problem.

    (to Mark's credit he realized all of this much more quickly than I did, but then he is much smarter than I am and has been here almost ten times longer than me so I think that would be expected!)

    This is not a regression, and has been around even longer than Unicode support in the console has been (perhaps even as far back as the OS/2 days) -- back when 0x0100 would have been a very sensible mark that could be used to indicate something that is not a character (since no characters were above 0xFF!).

    Good proof that there is always a small but measureable difference to be had between a nearly total rewrite and an actual rewrite. A difference that testers could make their living on if they needed to!

    But in any case it was a rather cool bug, if you ask me. I have no idea why it has never been reported before, but it is reported now!

     

    This post brought to you by "Ā" (U+0100, a.k.a. LATIN CAPITAL LETTER A WITH MACRON)

  • Sorting it all Out

    Are multilingual sorts integrated?

    • 6 Comments

    Someone using the handle 'And' asked the following in the Suggestion Box:

    Sometimes (e.g. in journals) Cyrillic/Greek letters are used in proper names. The sorting order then becomes somewhat complicated. Are systems giving results like the following in use? Are there any (relatively) widespread standards?

    • Aachen
    • Ἀλέξανδρος
    • Amsterdam
    • David
    • Дмитрій
    • Felix
    • Ἑλλάς
    • Patrick
    • Пётръ
    • Σπύρου
    • Thames
    • Жарковъ.

    This type of 'integrated multilingual sort' is not something that is directly supported by Windows and would require a massive reassignment of all weights to work properly.

    It is also not generally expected by users who I have spoken with, though. And has not really come up as a frequent request.

    Of course if one is trying to support such a sort, the easiest way to do it would be to keep for each entry a separate transliterated name, perhaps using the new Microsoft Transliteration Tool! :-) , and then with all of the strings on the level playing field of the same script you could simply sort by those entries to integrate the sort.

    But it is a specialty usage, not one that would expected in everyday sorting....

     

    This post brought to you by "Ж" (U+0416, a.k.a. CYRILLIC CAPITAL LETTER ZHE)

  • Sorting it all Out

    It can't be all about the money

    • 0 Comments

    Someone with the handle of *g* asked in the Suggestion Box:

    Hi Michael,

    I love your blog and read it regularily, even though my primary work interests are in device automation: simply because yours is one of the most interesting blogs to be had ;)

    I am working with devices that store money, and send information on amounts with currency information in isocode. Some tasks here are:

    1. given an isocode, display an amount e.g. "deposit 123,45 €".
    2. given an isocode, display a note value: note that commonly (where common is less-than-well-defined as in: western europe and north america), for note values there is no fraction. You don't have a $1,00 bill, you have a $1 bill.

    Now, there is always GetCurrencyFormat(). However, I have some problems with it:

    • it assumes that I know the lcid; however, the devices I'm working with only send me the isocode an assume (incorrectly, I admit) that e.g. all EUR formatting is the same across all countries
    • there is no sufficient discrimination between countries; e.g. if I have CAD and USD, both will show as $.
    • for situation (b) above I can only do custom formatting with a CURRENCYFMT structure; plus I need additional "language knowledge" on how bill values are presented to the user.

    Your thoughts?

    I hope my answer does not cause *g* to find my blog less interesting!

    There really is not a mechanism to support what is being asked for here -- because an ISO currency code without the context of a locale does not give enough information to do the formatting (and there is no automatable rule about the common name for curreny notes in any country, though there are patterns there is no way to definitively know what to call the currency short of the actual localization process. This is the problem in the request of the first bullet point -- the formats in different locales is often different, so we can't ever assume that it is not.

    The second bullet point is not really a problem if you do know the locale, unless you are trying to equate the currencies of two different locales based on the currency symbol -- which is again a bug in what the software is assuming (a bug which should be fixed right away before it costs people a lot of money!).

    When you get down to it, the third bullet point really is talking about a localization issue -- an interesting one, to be sure, but one that is way beyond what the NLS API is designed to handle.

    For better or worse, the context of a locale is indeed at the core of NLS support on Microsoft products, and without that context there is no real hope of giving the right results. This particular dependency can lead to other problems which I will talk about soon. :-)

     

    This post brought to you by "" (U+18a8, a.k.a. MONGOLIAN LETTER MANCHU ALI GALI BHA)

  • Sorting it all Out

    Handling multilingual data in SQL Server

    • 3 Comments

    Seth Siegal asked in the Suggestion Box:

    What is the best practice for searching large table in mixed languages. Context is SQL Server 2000 or SQL Server 2005. The problem is storing in a single column character data from mixed languages and then providing a search capability to find the best match given a search string in some arbitrary language. The column, of course, is Unicode data type with some collation. Is there an optimal default collation? What is recommendation for table design for efficiency -- collation, indexes?

    As an example, consider an international directory of business names and a stored procedure to search for a name "like" <some string> where <some string> is user input in any language. Since collation determines comparison rules it seems the appropriate collation is one that best matches the language of the search string. Further, to facilitate matching it seems appropriate to relax restrictions such as case and accent sensitivity. That is a strict binary comparison is not "user friendly." However there are serious performance implications when the collation of the search string does not match the collation of the database column and dynamic casting is used to make them match.

    Example (approximate SQL):

    MyBigTable (ID int not null, SearchString nvarchar(1000) not null collate <Optimal Collation For Mixed Languages>)
    create procedure UserSearch @SearchString nvarchar(1000), @SearchCollation nvarchar(128) as
    select * from MyBigTable
    where SearchString collate @SearchCollation like @SearchString

    The answer to this question is buried in some important implementation details surrounding the way collations work....

    The first issue is that every "Windows" (which is to say, not SQL compatibility) collation is a view of everything in the table, according to a particular language or set of languages.

    Now this helps the first parr if Seth's question -- any collation can be used if you are sticking with Unicode data (though how you want non-Unicode clients to behave when querying the data may have an influence your final choice).

    However, because of this design, it is easy to find situations where weights in one collation will not distinguish as well as weights in another collation, depending on how important the differences are for a given language. As I point out in You can't ignore diacritics when a language does not give them diacritic weight, one person's primary distinction can easily be another person's secondary distinction. Especially in a case where one is planning to ignore case or accent differences, this means that one person's equality is another's inequality.

    In order to properly support the querying of multilingual data well, you really do have to do both seek and ordering operations using the collation that will match the queryer's expectations.

    But as Seth points out this can lead to serious performance issues.....

    Luckily, I haved posted about how to work around that issue. :-)

    In the post Making SQL Server index usage a bit more deterministic I explain how to make sure that performance will not suffer with such operations do to non-indexed queries being run. And I highly recoomend this technique be used any time you do have to commonly deal with multiple languages....

     

    This post brought to you by "ޝ" (U+079d, a.k.a. THANNA LETTER SHEENU)

  • Sorting it all Out

    The parts of AP

    • 6 Comments

    There is an old joke you have probably heard some version of:

    One day a blonde decided that she was sick of all the "Dumb Blonde" jokes people were telling. She decided she would show everyone that blondes really were smart and set out to learn the capitol of each state in the USA. A few days later, she overheard the folks at the watercooler, again telling blonde jokes. Having had her fill of it all, our blonde hero interrupted the group and advised she could prove to them that all blondes were not dumb. She said she could give the capitol of any state, and taken aback by her confidence, a gentleman asked her to name the capitol of the State of Ohio. She thought for a moment, then proudly proclaimed "O"!

    After reading Geoffrey K. Pullum's post The parts of speech on Language Log, I find myself feeling the same sense of frustration that Associated Press could get such a fundamental fact wrong.

    I mean, didn't any of them go to elementary school and learn enough about grammar to know that the parts of speech hardly require a lifetime of study? Or if they had not been paying attention, hadn't they ever seen Schoolhouse Rock?

    So I guess AP gets to be a step below our blonde heroine, who made an honest mistake between capitols and capitals. And at least she put a bit of effort into her blunder.

    Addendum 5:00pm: I was seeding in a mistake above for the sake of a post I was doing to do next week about word choices, but someone found it fast enough that I need to rethink the theory. Capitols are indeed in Capitals. :-)

    But to use words stupidly in an obit for a linguist? Now that is chutzpah.

    I won't say what part Associated Press was showing off this time, because it's one of those "words you aren't supposed to say in public" parts. If you know what I mean. :-)

  • Sorting it all Out

    Localisation via Wiki?

    • 1 Comments

    (UK spelling of localization in the title a nod to our friends on the other side of the puddle!)

    It was as long as six months ago and as recently as yesterday that people have asked me how I would feel about specific posts or perhaps even the whole blog here localized into one or more other target languages. I have usually been a bit nervous about anything beyond small attempts since my informal style is not the best target for a localizer.

    But Joel Spolsky is trying out a bold experiment, described in his post entitled Translations. He is trying a Wiki to support what would amount to a community translation project. Quite similar to what Wikipedia is doing with its content (though the latter is much scarier from a synchronization standpoint since the original source text could theoretically be any language!).

    I have some experience in managing large volunteer translation projects, and I agree completely with Joel about the problems with maintaining them over time (I would add to his list the quality issues and the variability of methods people were wanting to use to provide the text!).

    Now the Wiki idea here is undeniably is a very cool thing to keep an eye on, and I think it is a lot more likely to "succeed" (where success is defined as reasonable synchronization of content and consistent quality levels) than Wikipedia's model due to the impossibility of synchronization with source and target in constant flux. It is much more comfortable and indeed possible to stay in sync if you treat one as the master original and then any clouded or faulty meaning in the translations as a bug in localization for another language speaker to correct.

    It will also be an interesting place for people who are interested in language differences to keep an eye on changes that are made.

    Kudos, Joel!

  • Sorting it all Out

    Approaching linguiticalishnessality?

    • 8 Comments

    I took most of Friday off (I ended up putting in a few hours for a small keyboard snafu and answered a question or two while I was there, but otherwise it was a day off).

    I decided to go and see a linguistics talk being given over at the University of Washington at their weekly colloquium.

    The talk was about Cuzco Quechua, a language that has interested me since we added the following locales to windows XP SP2 (which I first mentioned a year ago in Lions and tigers and bearsELKs, Oh my!):

    • 0x046b      Quechua - Bolivia
    • 0x086b      Quechua - Ecuador
    • 0x0c6b      Quechua - Peru

    The talk (given by Rachel Hastings, whose PhD is in fact also about Quechua!) was a very interesting report about investigating the Definiteness Effect in Cuzco Quechua, looking epecially at existential and possessive sentences. Aside from a little bit of linguistic jargon in a few of the questions that people asked afterwards, I pretty much understood all of the things that were being said. I'm still not a linguist but I might have to upgrade that whole 'linguistic aptitude' thing from delusions to notions soon!

    I'll probably talk more about Quechua some time soon, it is a very interesting language, somehing I thought even before Rachel proved it is true in more ways than I realized.

    Anyway, I was talking to her after the presentation and was explaining about my interest in Quechua and other languages, getting into some of the language issues I had been working on lately. She was interested in the fact that we were supporting several locales covering the Quechua language, and I'll definitely try to get her on the Vista beta so she can see it right alongside all of the other languages (and maybe report if she sees any bugs!). It was cool making a language contact who is a linguist -- because no matter how helpful native speakers can be, someone who has done as much work as she has on the language? She has so much explicit knowledge that can be so helpful, it is amazing to meet such people....

    Then listening to people after the talk, like the one who pointed out how much Quechua has in common with Turkish. How does stuff like that happen, anyway?

    (A part of me wonders if I should think about going back to school, but I think I lack the discipline that it would take to do that. Perhaps I will look into taking a class next semester or something....)

    Then on my way out an undergrad stopped me and asked me if my name was Michael Kaplan. It turned out she had seen my Channel 9 video! I am not sure why exactly, but that is still fairly cool when I think about it. I mean, I know that people who read here might know about it, some may even watch it. But total strangers who are studying linguistics wasting 30 minutes of their lives to see me blather? That is awesome for reasons that I probably won't analyze further.... :-)

     

     

    This post brought to you by "Q" (U+0051, a.k.a. LATIN CAPITAL LETTER Q)

  • Sorting it all Out

    Some scenarios are too big to cover

    • 2 Comments

    As you might imagine, these days I get a lot of email from people in different groups at Microsoft about all kinds of internationalization issues.

    Usually they are polite, but some of them can be demanding at times!

    Now while it is true that I have a lot of different things that I know something about, it is not like I know everything about every single issue.

    Unfortunately, I usually do know the answer often enough that people keep on asking. :-)

    Anyway, yesterday someone was looking for a way to convert between Locale Identifiers (LCIDs) and culture names.

    So I immediately suggested the NLS API functions LocaleNameToLCID and LCIDToLocaleName.

    "I'm sorry," they said. "We need something that works in managed code."

    Ah, that is even easier since the CultureInfo class has both Name and LCID properties.

    "Hmmm," they said. "We need something that will work downlevel."

    No problem, since the .NET Framework can run as far downlevel as Windows 98 if needed.

    "Ah, but we have to handle custom locales, too," they offered.

    As you can imagine, I was getting a little exasperated by then. I pointed out that custom locales are a new feature in Vista, so there was no downlevel requiremnt for them.

    "No, we need to support all of these scenarios" was the response.

    Hmmm.... so basically they need:

    • both names and LCIDs
    • in both managed and unmanaged code
    • on Both Vista and downlevel
    • for both built-in and custom locales

    But they wanted to be "good citizens" and start using names and getting rid of their LCID dependency.

    (of course in most societies the good citiziens would not be expected to be quite this demanding, but that is another story!)

    Well, I told them, they will probably need to store both names and LCIDs.

    We simply were not able to know through special psychic powers that all these things would be needed to build such functionality into prior versions. Not that I would complain if I had this ability, but for the present it is simply beyond my power....

    Anyway, I think when the extreme nature of their request was so baldly laid out, they kind of understood they were looking for a bit more than was possible. Which is not to say they won't be emailing me tomorrow to ask another question. :-)

     

    This post brought to you by "" (U+134f, a.k.a. ETHIOPIC SYLLABLE FWA)

  • Sorting it all Out

    The font known as 'MS Sans Serif'

    • 12 Comments

    Loyal SIAO reader Serge Wautier just posted about MS Sans Serif. In the post he talked about this font a bit, saying:

    The problem is that this font supports the Western European code page only.When one adds an Eastern European language such as Polish, appTranslator realizes that the font will not display correctly in Polish hence replaces the default dialog font by Tahoma.Now Tahoma is a _little_ bit wider than MS Sans Serif, resulting in long source text items being cropped, especially if the controls were really just wide enough to display the text.

    I figured it might be good to set the record straight. :-)

    Now if you look in the fonts folder (click to see it bigger):

    You will see mostly TrueType fonts, and then you will see some or most of the following bitmap fonts:

    • Courier
    • MS Dialog
    • MS Sans Serif
    • MS Serif
    • Roman
    • Script
    • Small Fonts
    • Symbol (Symbol)
    • perhaps Terminal
    • those WST* fonts

    (Your exact list may vary, of course!)

    Note the total number of fonts I have there -- 283, including a visible Marlett.

    Now, let's go the DOS prompt and look to see what is there:

    Hmmm.... looks like there are a few more there than the font folder is admitting to, isn't there? :-)

    The fact is that what you as MS Sans Serif is actually one of many different files covering different sizes and code pages. When you change your default system locale, one of the dances that happens is that the fonts registered with Windows will be changed to match your choice and your font size and resolution, so that you will be able to see text in the console (more on dealing with this issue another day!).

    So what you see as one entry can actually be one of a dozen or more font files. And it won't always be Western European, though it will work to match your default system locale....

    Which does not change the validity of the other things that Serge said, including the fact that Microsoft Sans Serif is a much nicer font! In fact this whol post is using Microsoft Sans Serif, if you have it installed!

    (I will talk more about Microsoft Sans Serif another day....)

     

    This post brought to you by "M" (U+ff2d, a.k.a. FULLWIDTH LATIN CAPITAL LETTER M)

  • Sorting it all Out

    Custom encodings in Word?

    • 3 Comments

    Ivan Petrov asked in the Suggestion Box:

    Hi Michael.

    When I open a text file in Microsoft Office Word, Word attempts to detect the encoding standard used for text in the file. Word can automatically detect most encoding standards. When the file's encoding standard matches the default encoding standard used to save files as plain text in the version of Microsoft Windows I'm running, Word opens the file directly.

    If Word cannot detect the encoding standard, or if it detects an encoding standard that doesn't match the default standard used by Windows, I must verify or choose the encoding standard from a list in the File Conversion dialog box. Word then uses the encoding standard I choose to convert the file to Unicode. I can preview the text to check whether it is readable before I open the file.

    So my question is:
    "If I have tons of TXT files encoded in custom (not supported by Microsoft!) encoding standard (for example: Bulgarian MIK encoding standard ) so, is it possible (in Word) and to be created and added a custom encoding standard (which to be displayed in the list of encoding standards in the File Conversion dialog box) to open and view correctly all this custom encoded TXT files? And if "YES", How To?"

    Regards,
    Ivan.

    The answer is probably not going to be one that Ivan is going to like very much (it is unfortunate, but I have reason to believe that he is the person most often unhappy with answers he gets from this blog. Sorry Ivan!).

    I believe that Microsoft Word may actually be using MLang's code page detection code to guess at what the code page may be, and I know they also do some other extra steps beyond that. But there is no mechanism to add code pages to the detection list in Word (or th list of code pages in Windows).

    Now of course this is another area where we have gotten feedback. And the message has been clear that in some cases in order to move to Unicode (or even to Windows in some cases) that the need to provide data migration methods can be important. But it is really not possible to say much at this point about what will be taken from that list and put into actual product. This is the sort of thing that I will certainly comment about when I can, though....

     

    This post brought to you by "д" (U+0434, a.k.a. CYRILLIC SMALL LETTER DE)

  • Sorting it all Out

    A bit about Marlett

    • 9 Comments

    Did you know that some of the graphics used in Windows are not actually graphics but are instead represented as symbols in a font?

    It is true. And that font's name is Marlett. From its own description, as found in the TrueType Font Properties Extension tool:

    Looking at it in Character Map, you may see a few familiar UI elements from your Windows experience:

    And although the file sometimes tries to hide itself, it is always there helping Windows have the right look and feel....

    I find this font to be pretty cool, myself. Maybe we could get the Invisible Jet added! :-)

     

    This post brought to you by "" (U+125a, a.k.a. ETHIOPIC SYLLABLE QHWI)

  • Sorting it all Out

    ISO 8601 redux

    • 4 Comments

    The other day, colleague Shawn Steele posted in his blog about the ISO 8601 Week of Year format in Microsoft .Net, which explains how to work around the fact that we do not exactly support the standard in our implementation. And some readers may recall when I linked to Isaac K. Kunen's post about how to work around this in SQL Server using SQLCLR integration which uses a slightly different method to get at the answer.

    Now this does not mean I do not still  think that ISO 8601 is asinine  -- because I do. But I am not blind to the fact that not all of the content in the standard is (like I said!), and that if people are then it is likely not for the same reasons. :-)

    If they just did not focus on human readability while being so insensitive to human preferences, I'd probably change my mind here....

     

    This post brought to you by "" (U+0d96, a.k.a. SINHALA LETTER AUYANNA)

Page 1 of 5 (63 items) 12345