MSI Databases and Code Pages

MSI Databases and Code Pages

  • Comments 5

A Windows Installer database is full of strings. Most times those strings don't cause a problem when using the standard, printable characters found in all code pages. These are called ASCII characters and are the same for the first 7 bits (0x00 through 0x7F) for all code pages except for a few rare code pages in existence for legacy support. If a Windows Installer database requires extended characters — characters where the 8th bit is set (0x80 through 0xFF) — then a code page is necessary to define how those characters are displayed. For example, decimal character 255 is ÿ in ANSI code page 1252 (ANSI - Latin1) but я in ANSI code page 1251 (ANSI - Cyrillic). The database code page is used to display strings in Windows 9x/Me and used to convert strings to Unicode on Windows NT when calling the W functions.

It is recommended to use only ASCII characters and then you can author a database with a neutral code page (0). Such a database could be used by any language. If you must include extended characters, you should set the code page for the database before importing any strings or risk corrupting extended characters. For localized product installation databases this would be common, since many languages require extended characters. Once you set the code page for a database all imported text files must specify the same code page or the import will fail. A file to be imported — common referred to as an IDT archive file — would look like the following example:

Property	Value
s72	l0
1252	Property	Property
ProductLanguage	1033
ProductName	Microsoft Visual Studio 2005 Team Suite — ENU

The first row contains the column names and the second row contains their respective types. The third row contains the optional code page, followed by the required table name and an optional list of tab-delimited primary key column names. The example above is part of the Property table for Visual Studio 2005. I have inserted 1252 as the code page for this example since the English SKU uses only ASCII characters.

You can easily display or change the code page for the database — along with the supported package languages and the product language for strings not authored into the MSI database (such as Windows Installer error message not in the Error table) — using WiLangId.vbs from the Windows Installer SDK, part of the Platform SDK.

Unofficially, MSI databases do support UTF-7 and UTF-8 by specifying code pages 65000 and 65001, respectively. Encoded strings will store correctly and will be converted correctly when the W functions are called, but they may not display properly because the correct font for wide characters is not chosen.

With this in mind, don't be surprised if you open a database with a code page different from your current system code page in Orca and find that some characters are not displayed correctly (they will most likely appear as boxes or simply the wrong character). The strings are being displayed or converted to Unicode according to the database code page.

It's also important to note that the database code page is different from the Summary Information stream code page, which is property ID PID_CODEPAGE (1). This is the code page in which the summary information properties are encoded.

Leave a Comment
  • Please add 5 and 2 and type the answer here:
  • Post
  • > These are called ASCII characters and are
    > the same for the first 7 bits (0x00 through
    > 0x7F) for all code pages except for a few
    > rare code pages in existence for legacy
    > support.

    That isn't true. Some code pages are rarely used in some countries but that's not what defines them as legacy. If you want to say that ALL code pages are legacy then you could have a point. But some that are in daily use on tens of millions of machines are not more rare or more legacy than your favourite code page is.

    Now, most Microsoft software treats codepoints 0x00 through 0x7F the same way in all code pages, treating them as if they were ASCII. In this sense it is pretty much OK for user programs to do what Windows does. That still doesn't make them the same as ASCII.
  • Norman, I wasn't referring to code pages that some languages don't use as "legacy", but code pages like EBCDIC from IBM, but granted that's not legacy on their mainframe peripherals and operating systems.

    The Windows code pages are the same for ASCII but there are other code pages where the first 127 characters may be different.
  • I forgot that EBCDIC also had code pages but I guess I see why you thought I was talking about EBCDIC. So please let me clarify.

    A subset of ASCII characters is common to most national standards so most of the first 128 characters are identical across them, but that doesn't mean that all 128 are. In code page 932, among the first 128 characters, only 2 differ from ASCII. In code pages for ..[*] some European languages, among the first 128 characters there are more than 2 differences from ASCII. Most Microsoft coding treats all of the first 128 characters as if they were ASCII, despite the fact that Microsoft fonts display them properly. So it's pretty much OK for user programs to treat them that way too. That doesn't really mean that the number of differences from ASCII among the first 128 has suddenly dropped to 0.

    [* I hate to say "legacy" after my previous posting, but it's a bit safer to say it for countries that have moved to newer national or international standards such as ISO Latin 1.]
  • How to specify the code page used for tables in the patch database.
  • Yes, WiX can produce a Unicode MSI.

Page 1 of 1 (5 items)