File redirection corruption?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

File redirection corruption?

  • Comments 15

A question I received in email:

In the FRA and ESN OSes, when I type some word on the command prompt with an acute-accented e like génération and redirect it to a file (eg: “echo génération > abcd.txt”) then the file contains a comma instead of the é. (The file has g,n,ration). But when I don’t redirect, I can see the character properly in the command prompt. I am also able to copy-paste to that file with the characters intact. Only the redirection is causing trouble.

Can you advise as to why this could happen? I would have thought that it has the wrong code page but the fact that I can see the characters properly on screen seems to preclude that guess. Any help would be very appreciated.

Of course, any time the difference is between a console application and a regular windows application, the first guess as to the problem is one of those OEMCP vs. ACP issues.

So, looking at some code pages (1252 and 437):

On code page 437, 0x82 is U+00E9 (é -- LATIN SMALL LETTER E WITH ACUTE), while On code page 1252, 0x82 is U+201a ( -- SINGLE LOW-9 QUOTATION MARK).

So the output was never different at all -- but the way that the underlying byte was being interpretted was....

 

This post brought to you by "" (U+201a, SINGLE LOW-9 QUOTATION MARK)

Comment on the blather
Leave a Comment
  • Please add 2 and 1 and type the answer here:
  • Post
Blog - Comment List
  • > I am also able to copy-paste to that file with the characters intact

    So the copy/paste is switching code pages automatically?  How does that work?
  • Hi Maurits --

    Well, usually each application that is not smart enough to use Unicode (such as the console) is smart enough to properly pivot from the code page it is using TO Unicode (either converting and putting CF_UNICODETEXT on the clipboard or just putting up the code page and letting the clipboard map and convert through synthetic clipboard formats)....
  • I see... copying from the console gets you to Unicode (through WM_COPY, presumably) but output redirection is a naked string of bytes.

    And for some reason (?) the console is using a different code page than Notepad.

    So "type abcd.txt" shows the accents, and "notepad abcd.txt" shows the commas. (Verified)
  • Ah, a fix!

    "The command processor has an option (/U) to generate all piped and redirected output in Unicode rather than the OEM code page."
    http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
  • I would just run "chcp 1252" so that the console code page was the same as the system code page.
  • Why is it that when i run chcp 1252 and paste é (U+00E9) from character map, it displays Θ  which is E9 in cp 437? What is the conversion that happens here and on what basis?
  • Hi Srivatsn,

    Well, chcp affects the output code page -- but what you enter in the console is input, not output. So the OEMCP is used....
  • So if I type:

    chcp 1252
    echo génération > abcd.txt
    notepad abcd.txt

    Notepad will show an eacute.

    Now, by default the console uses the Terminal font, which has a theta at code point 0xE9. However, Lucida Console is a Unicode font and things show up as I expect them to.

    I would just recommend that the original email user set his console font to Lucida Console and use chcp 1252, and he should get what he expects.
  • Or try the /U option and have some of those other scebarios work, too.... :-)
  • When I write batch files that require "international" characters, I put "chcp 1252" at the beginning of them because I can't guarantee that they'll be run by a Unicode cmd.exe.
  • Um, "international" is at a minimum worthy of a "chcp 65001", isn't it?

    I mean, with 1252 being such a far cry from "international" ? :-)
  • I think that's why he put "international" in quotes... at least it's "more" international that US-ASCII.

    Anyway, for serious international stuff, I'd say switch to Monad or something (if possible anyway)... it's much more consistent, being .NET and all Unicode internally.
  • There doesn't seem to be a code page that will allow "type" to print a UTF-16 text file.  chcp 1200 and chcp 1201 both return "Invalid code page."  This could be worked-around with some kind of utf16le_to_utf8.exe, which would read UTF-16LE and spit it out as UTF8:

    rem make a utf16le file
    cmd /c /u echo génération > utf16le.txt

    rem switch console to the UTF8 code page
    chcp 65001

    rem type the file back to the console with the utf16le_to_utf8 shim
    type utf16le.txt | utf16le_to_utf8
  • Er, switch /c and /u in that cmd call:
    cmd /u /c echo génération > utf16le.txt
  • Yup, that works.

    C:\>chcp
    Active code page: 437

    C:\>cmd /u /c echo génération > utf16le.txt

    (Opening utf16le.txt in Notepad and a hex editor confirms the UTF16-LE-ness of the file.)

    C:\>type utf16le.txt
     Θ n Θ r a t i o n

    C:\>chcp 65001
    Active code page: 65001

    C:\>type utf16le.txt
    ???n?r?a?t?i?o?n? ?
    ?

    (type'ing a UTF16-LE-encoded file in a UTF8 code page doesn't work...)

    C:\>type utf16le.txt | perl utf16le_to_utf8.pl
    génération

    (... but piping it through a converter does.)

    For the sake of completeness, here's the code for the converter:

    C:\>type utf16le_to_utf8.pl
    use strict;
    use Encode;

    # slurp whole files to avoid spurious line break issues with 0d 00 0a 00 etc.
    undef $/;

    # read text
    my $text = <>;

    # convert text
    Encode::from_to($text, 'UTF-16LE', 'UTF-8');

    # output converted text
    print $text;

    (It's probably reasonably trivial to write a simple .exe to convert from UTF-16LE on wcin to UTF-8 on cout... that would obviate the need for Perl.)
Page 1 of 1 (15 items)