## Some notes from someone smarter than me about Perl and Unicode ...# ----## Which encoding do you want to use? UTF16-LE is the standard on Windows (nearly# all characters are encoded as 2 bytes), UTF8 is the standard everywhere else # (characters are variable length and all ASCII characters are a single byte).## Here's what I've figured out after lots of experimentation. To get UTF16-LE # output you need to play a few games with perl...## open my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", "e:\\test.txt";# print $FH "\x{FEFF}";# print $FH "hello unicode world!\nThis is a test.\n";# close $FH;# Reading the IO layers from right to left (the order that they will be applied # as they pass from perl to the file) ...## Apply the :utf8 layer first. This doesn't do much except tell perl that we're # going to pass "characters" to this file handle instead of bytes so that it # doesn't give us "Wide character in print ..." warnings.## Next, apply the :crlf layer as text goes from perl out to the file. This # transforms \n (0x0A) into \r\n (0x0D 0x0A) giving you DOS line endings. Perl # normally applies this by default on Windows but it would do it at the wrong # stage of the pipeline so we removed it (see below), this is where it ought to # be.## Next apply the UTF16-LE (little endian) encoding. This takes the characters # and transforms them to that encoding. So 0x0A turns into 0x0A 0x00. Note that # if you just say 'UTF16' the default endianness is big endian which is # backwards from how Windows likes it. However, because we're explicitly # specifiying the endianness perl will not write a BOM (byte order mark) at the # beginning of the file. We have to make up for that later.## Finally, the :raw psuedo layer just removes the default (on Windows) :crlf # layer that transforms \n into \r\n for DOS style line endings. This is # necessary because otherwise it would be applied at the wrong place in the # pipeline. Without this the encoding layer would turn 0x0A into 0x0A 0x00 and # then the crlf layer would turn that into 0x0D 0x0A 0x0A and that's just goofy.## Now that we've got the file opened with the right IO layers in place we can # almost write to it. First we need to manually write the BOM that will tell # readers of this file what endianness it is in. That's what the # print $FH "\x{FEFF}" does.## Finally we can just print text out.## If you want UTF8, I'm pretty sure it's a lot easier. Also, this is also a lot # easier on unix, the CRLF ordering problem is definitely a bug but the default # to big endian (and ensuing games to get the BOM to output without a warning) # are by design. I'm pretty sure that none of the core perl maintainers use perl # on Windows (even though at least one keeps perl on VMS working...).### Until Exchange decides it wants a Unicode eseperf.ini, we're going to generate# the old ASCII one. Also if Exchange wants one, it will have to update it's# version of Perl to understand the open modes we're using below. Currently we# get this error:# 1>Unknown open() mode '>:raw:encoding(UTF16-LE):crlf:utf8' at .\perfdata.pl line 325, line 6189.#if ( $ESENT ){ #ifdef ESENT open( INIFILE, ">:raw:encoding(UTF16-LE):crlf:utf8", "$INIFILE" ) || die "Cannot open $INIFILE: "; print INIFILE "\x{FEFF}"; # print BOM (Byte Order Mark) for the unicode file} else { #else open( INIFILE, ">$INIFILE" ) || die "Cannot open $INIFILE: ";} #endif