I'm not a Klingon ( )

Shawn Steele's thoughts about Windows and .Net Framework globalization APIs

Browse by Tags

Tagged Content List
  • Blog Post: Converting text file code pages

    I've said "use Unicode" a lot, but sometimes there are programs that aren't doing what you'd expect, and outputting stuff in a different code page. Additionally, you might sometimes encounter a text file that was created using the system code page of a different machine. (Like if someone emailed me a...
  • Blog Post: Unicode 6.0 has a new Indian Rupee Symbol, how do I get it?

    Well you can't, not yet anyway. Unicode 6.0 adds the new Indian Rupee Symbol at U+20B9 (see http://www.unicode.org/charts/PDF/U20A0.pdf ) so how do you get it to work? Unfortunately you can’t get it to work immediately ;-(. The problem’s actually really complicated as there are lots of...
  • Blog Post: Thoughts About Email Addresses with EAI (Email Address Internationalization)

    The EAI Working Group ( http://datatracker.ietf.org/wg/eai/charter/ ) is making rapid progress toward standardizing Unicode email addresses. Unicode email addresses are a terrific feature for people in many countries that don't use Latin/ASCII as a native script. Ironically, in the US its easy to miss...
  • Blog Post: The Square Boxes in My Blog's Title

    Someone pointed out the boxes in my blog's title. That's a script some fans use for Klingon, but since it's not in Unicode, you need a pIqaD font to see it correctly. If you really want to see the square boxes, then grab the pIqaD.ttf font from the .zip in my earlier post: http://blogs.msdn.com/shawnste...
  • Blog Post: UTF-8 usage on web approaching 50%

    Google posted an interesting chart: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html I'm sure Bing has similar data, but since Mark already built the chart it was easier for me to link there :) Hopefully this will mean less code page confusion in the browser/server space, which...
  • Blog Post: Most combining characters in a Unicode glyph/character/whatever

    Recently on the Unicode list someone asked basically what the biggest number of combining characters could happen in a sequence. It's as many as someone wants to use, though the normalization UTS15 adds a limit, and the font rendering problem gets weird. An interesting example appeared on the list...
  • Blog Post: Alternate encoding names recognized by .Net / IE

    If you run the sample from http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx then you can get a list of what Microsoft .Net thinks each Encoding/Code Page's name is. (WebName is more consistent to what's used in charset). eg: using System; using System.Text; public class...
  • Blog Post: Unicode, IDN (IDNA), EAI (IMA) and Homograph Security

    I wrote about IDN & Security before http://blogs.msdn.com/shawnste/archive/2005/03/03/384692.aspx but thought I'd share some of my more updated views about security of URLs/IDN/Unicode/Email addresses. People haven't really bothered much with DNS or character based security when it was limited...
  • Blog Post: Writing "fields" of data to an encoded file.

    The moral here is "Use Unicode," so you can skip the details below if you want :) A common problem when storing string data in various fields is how to encode it. Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file. However, sometimes data gets mixed...
  • Blog Post: Don't use MB_COMPOSITE, MB_PRECOMPOSED or WC_COMPOSITECHECK

    This pretty much demonstrates another reason to Use Unicode, but if you do need to use some non-Unicode encoding until you can convert to Unicode, please don't use these flags. MultiByteToWideChar() and WideCharToMultiByte() provide some interesting sounding flags that are actually useless, slow,...
  • Blog Post: Front page uses windows-1252, shouldn't it be iso-8859-1?

    I received this question: I use Frontpage for my webpage design and FP automatically inserts the meta tag "<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">". Should I have reference to ISO-8859-1 ? I'm not a front page expert, and I can't answer all questions...
  • Blog Post: Where to Look Up Information About Microsoft Code Pages?

    First of all, remember to Use Unicode when practical :) Sometimes older applications don't allow Unicode, although they usually then don't allow Microsoft code pages as well (usually being ASCII or Latin-1, which are different). But when you do have a question about how Microsoft's "ANSI" (They're...
  • Blog Post: Unicode use on the web

    Google posted a blog about unicode use on the web http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html They also announced that they now support Unicode 5.1, which is probably a good thing, but I found the graph most interesting http://bp1.blogger.com/_Ap14FtNN91w/SBzrtHJfLnI/AAAAAAAAA5U...
  • Blog Post: Server 2008 U+FFFD behavior for unknown or illegal UTF-8 sequences.

    In my post Change to Unicode Encoding for Unicode 5.0 conformance I mentioned that the behavior of illegal characters has changed for Unicode 5 conformance in Windows Vista / .Net 2.0+. Those changes have also been inherited by Server 2008. Also check out my collection of code page related articles...
  • Blog Post: Code pages and security issues

    One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. In short if data is going to be converted between code pages after...
  • Blog Post: Michael has a blog about converting apps from ANSI to Unicode

    Lots of apps are now Unicode, but some need to make the shift from ANSI (like Japanese shift-jis) to Unicode. Michael has a series of blog posts about a project conversion. http://blogs.msdn.com/michkap/archive/2007/01/05/1413001.aspx I recently had a customer question about removing a shift_jis dependency...
  • Blog Post: Are we going to update or maintain the best fit &/or code page mappings?

    People wonder if we're going to update our best fit code page mappings, or even our code page mappings. The answer is no. Changing character mappings causes difficulties for applications and our experience has been that doing so breaks as much as it "fixes". We'd prefer applications move to Unicode,...
  • Blog Post: UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns.

    My post Change to Unicode Encoding for Unicode 5.0 conformance now applies to .Net 2.0 with MS07-040 applied. Updates include a list of known issues, please see the list of known issues for MS07-040 described in KB 931212 for more information. KB 940521 describes this behavior in pandrticular. This fix...
  • Blog Post: I see my favorite Ansi function has the behavior I want.

    Occasionally I am asked about the A version of a W function. Ie: GetLocaleInfoA does something that appears more convenient to some user than GetLocaleInfoW. The implied thought is that maybe they should just use the A version. For the most part our A functions are just wrappers for the W functions...
  • Blog Post: Why can't we strip the diacritics?

    We have some "best-fit" behavior which we generally consider to be "bad". Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything). Assuming you can't use Unicode, why is it so bad to just make everything ASCII-like? Maybe you have a published house...
  • Blog Post: Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted

    We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail. That's a lot of data, but it still shouldn't do that. This is a problem with our built in fallbacks. Ironically, if you encounter bad bytes then the bug is reset and you're "good" for another...
  • Blog Post: MLang & MSXML6 doesn't like UTF-7

    In some cases MLang (on which MSXML6 depends) can added extra ? to decoded UTF-7 data, which can cause UTF-7 encoded XML to fail to parse. UTF-7 isn't a great encoding anyway, so this is just another reason to Please Avoid UTF-7 . In particular there doesn't seem to me to be much reason to use...
  • Blog Post: How do I get HKSCS 2004 characters from Big-5 in .Net?

    Well, that's pretty tricky. We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code. The fundemental problem is that these "HKSCS" characters were in use prior to the assigment of a code point for them in Unicode. In...
  • Blog Post: How do I get my ANSI based application to run correctly?

    A common question is "how do I get my ANSI based code page application to run on a system that has a different code page?" The most obvious solution is to use Unicode :) Then you won't have the code page messiness that leads to this kind of problem. For some legacy applications you may need a stop...
  • Blog Post: Please avoid UTF-7

    UTF-7 inherently some of the security issues that concern people about encodings. For example, by shifting in & out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems. UTF-7 is primarily interesting for legacy mail and NNTP applications...
Page 1 of 2 (44 items) 12