Behind 'How to break Windows Notepad'

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Behind 'How to break Windows Notepad'

  • Comments 54

Larry Osterman pointed me at an article entitled How to break Windows Notepad that makes for an interesting experiment:

Here's how to do it:
1. Open up Notepad (not Wordpad, not Word or any other
word processor)
2. Type in this sentence exactly (without quotes): "this app can break"
3. Save the file to your
hard drive.
4. Close Notepad
5. Open the saved file by double clicking it.

Instead of seeing your sentence, you should see a series of squares. For whatever reason, Notepad can't figure out what to do with that series of characters and breaks

Now if you have East Asian language support installed, instead of seeing squares (NULL glyphs), you will see:

桴獩愠灰挠湡戠敲歡

An if you look at the code points under those characters, you will likely see what happened:

6874 7369 6120 7070 6320 6e61 6220 6572 6b61

Ah, each byte is a letter that when combined just so happens to line up with a CJK ideograph!

I have talked about the encoding detection mechanisms that notepad uses recently, and this is another example of the problem, one that is more fun since the repro steps are so much fun (in fact the only improvement would be text insulting Microsoft or one of its rivals, which notepad appears to censor in an example of a big bad monopoly, etc.!).

Now I have pointed out that I do not like the IsTextUnicode function in the past, and I suppose this could be considered a good reason (IsTextUnicode returns TRUE here, which is why Notepad guesses as it does).

 

This post brought to you by (U+6874, a CJK ideograph)

Comment on the blather
Leave a Comment
  • Please add 6 and 7 and type the answer here:
  • Post
Blog - Comment List
  • Cool post man. I like it...
  • Maybe Notepad should offer an option to bypass encoding detection?
  • Micheal Kaplan (who writes a blog about internationalization and localizability of software here) has...
  • select cast(0x687473696120707063206e61622065726b61 as varchar)

    htsia ppc nab erka
  • Note that the lengths of the words are reversed:

    Original: 4 3 3 5
    Changed: 5 3 3 4

    Ignoring spaces, note that the characters within each word are scrambled, but the order of the words themselves remain the same

    thisappcanbreak
    htsiappcnaberka

    A question to ponder... did the two "p"s switch places?
  • That's just endian-ness, I have talked about that before. :-)
  • Hi Lionel -- well, I'd personally prefer if they added "BOM-free UTF-8 supoort" as a save option prior to "no detection" as a load option. :-)
  • > did the two "ps" switch places?

    Yup, I think they did.  The scrambling is simple:

    Scramble each word individually

    To scramble a word, start at the end.
    Switch the last two letters of the word.
    Switch the previous two letters of the word.
    Keep switching letter pairs until you have either scrambled the whole word (if there were an even number of letters) or there's a single letter left.
  • > That's just endian-ness

    Oh, duh.

    Switch byte-pairs:

    select cast(0x74686973206170702063616e20627265616b as varchar(20))

    this app can break
  • This still leaves open the question of why IsTextUnicode("this app can break") == TRUE -- looks like ASCII to me.  Maybe some of the component tests will reveal a clue.
  • Huh... I don't have a IS_TEXT_UNICODE_BUFFER_TOO_SMALL test on my W2K system.  Do I have to include something special to get it?  I've got all the others.

    Barring that, the only tests that fire for that string are:
    IS_TEXT_UNICODE_STATISTICS
    IS_TEXT_UNICODE_UNICODE_MASK
  • Looks like wine couldn't find the IS_TEXT_BUFFER_TOO_SMALL test either!

    http://cvs.winehq.org/patch.py?id=17837
    /* FIXME: MSDN documents IS_TEXT_UNICODE_BUFFER_TOO_SMALL but there is no such thing... */

    Those wacky docs guys, always making up flags ;)
  • This will never be a fixable problem...

    If you made IsTextUnicode("this app can break") return FALSE, then you'd just have some Chinese guy saying "when I type '桴獩愠灰挠湡戠敲歡' into notepad, save it and reopen it, it just displays some funny English characters!"

    Actually, I think it would save with a BOM in that case, so it probably wouldn't do that. But you get the idea :-)
  • I think I know why IS_TEXT_UNICODE_BUFFER_TOO_SMALL is missing.

    Looking at winnt.h, the four "masks" are defined as 0x000f, 0x00f0, 0x0f00, and 0xf000.  There seems to be an unwillingness to break the 17th bit for some reason; and IS_TEXT_UNICODE_BUFFER_TOO_SMALL doesn't fit any of the masks; so it was dropped.  But it's still in the documentation, which is a documentation error.
Page 1 of 4 (54 items) 1234