Unicode? Zip don't need no stinking Unicode!

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Unicode? Zip don't need no stinking Unicode!

  • Comments 8

I have talked about the limitations in ZIP before in the post Zipping up Unicode file names, but Heath has pointed out a new and interesting wrinkle in the problem in his post Update for the Palm Treo 700w Available, with Problems.

Now Heath may seem to some to be some kind of lightning rod for Unicode Lame List stories, but he isn't -- he is just a smart developer who is finding himself thrown into bad software situations that he did not design....

In this case we see the biggest problem with not using Unicode -- the basic problem of deciding what code page to use. It is probably not so much that zipfldr.dll is specifically using cp437 and cp1252, it is that it is using CP_OEMCP and CP_ACP.

What causes such a mistake to not get noticed, though? I mean, it is pretty un-natural to be using both constants, isn't it?

As luck (or unluck) would have it, they are not. The problem starts with the Shell folks, are using funky macros wrapped around funky shlwapi wrappers like SHAnsiToUnicode and SHUnicodeToAnsi. I call them funky because they are. They are also quite consistent in their underlying use of CP_ACP always.

And as for the rest of the problem, it looks like the CP_OEMCP is coming from the fact that it is a console app that is running things so that some of the translations are happening in this different context....

How smart is Palm feeling for putting and ® in the filename, at this point? No wonder they took the update down. :-)

Clearly we'll need to see people using ASCII file names until people move up to Unicode. Code pages are just too damn confusing!

 

This post brought to you by "®" and "(U+00ae and U+2122, a.k.a. REGISTERED SIGN and TRADE MARK SIGN)

Comment on the blather
Leave a Comment
  • Please add 7 and 3 and type the answer here:
  • Post
Blog - Comment List
  • I don't think it's that there's any console app involved - there shouldn't be, since zipfldr.dll is just a shell extension server DLL; I think it's just that they interpreting the file names as DOS file names and using the OEM code page.

    What else is interesting regarding code pages is that when I came to your page the (R) and (TM) glyphs weren't displaying correctly. I actually filed a bug last night on Community Server because the new post page uses ISO-8859-1 while the page content sends the Content-Type HTTP header with UTF-8. In this case, Internet Explorer apparently ignored the Content-Type header and automatically chose Windows 1252. I changed the encoding to UTF-8 and it appears correctly. How lame.

    For my post you linked to I actually used the appropriate HTML entities where defined, and coded the entities myself otherwise. It's a big hassle to have to fix-up my HTML but hopefully I won't have to with a future update to Community Server.

    PS: When I was younger I was also a lightening rod for major accidents. Before you know it, I'll have a white stripe of hair and go running every time a storm brews. (Anyone know the reference?)
  • Oh yeah, and don't forget to add to the lame Unicode support you mentioned about my blog from http://blogs.msdn.com/michkap/archive/2005/10/08/478479.aspx. You should add that to the "Unicode Lame List" category, too.
  • Good idea on the post category for that other post. :-)

    All I know is that the DLL does not ever use the OEMCP -- so someone is putting that particular interpretation on it....

    I can't repro the encoding problem of the pages, though -- everything seems to display fine here, for me at least. I wonder why?
  • <<In this case, Internet Explorer apparently ignored the Content-Type header and automatically chose Windows 1252.>>

    A good reason to always add the meta "text/html; charset=utf-8" to web pages.
  • Just over a week ago I was posting about Unicode? Zip don't need no stinking Unicode!
    Well, as Heath...
  • I have talked about WinZip and Zip in the past, in particular their odd relationship with Unicode, in

  • "All I know is that the DLL does not ever use the OEMCP"

    Except it does, I looked at the zipfldr.dll imports and it imports OemToCharBuffA and CharToOemA.

  • Anyway, now there is this hotfix:

    support.microsoft.com/.../2704299

Page 1 of 1 (8 items)