Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
The presented scenario is simple (even more simple as I will present it here than it was originally!):
Now this is an easy scenario, and anyone can write this up, in their spare time. I just wrote it up myself in multiple programming languages using WinForms, because I was bored and had never tried it before. And with text in multiple actual languages because I am wired that way and have more keyboard layouts than possibly anyone in the entire freaking universe.
I even named the form Magic Carpet Ride, to help ameliorate the boredom.
This did not work, for what it's worth.
so instead, I entered the following 20 characters into my Magic Carpet Ride form:
That last character is U+20000, the first Extension B ideograph of Unicode (aka U+d840 U+dc00, to its close friends who he is not ashamed to be disrobed, as it were, in front of)....
And now we have a ball game.
Because when TextBox.MaxLength talks about
Gets or sets the maximum number of characters that can be manually entered into the text box.
what it really means is
Gets or sets the maximum number of UTF-16 LE code units that can be manually entered into the text box and will mercilessly truncate the living crap out of any string that tries to play cutesy games with the linguistic character notion that only someone as obsessed as that Kaplan fellow will find offensive (geez he needs to get out more!).
I'll try and see about getting the document updated....
Regular readers who remember my UCS-2 to UTF-16 series will note my unhappiness with the simplistic notion of TextBox.MaxLength and how it should handle at a minimum this case where its draconian behavior creates an illegal sequence, one that other parts of the .Net Framework may throw a
System.Text.EncoderFallbackException: Unable to translate Unicode character \uD850 at index 0 to specified code page.
exception if you pass this string elsewhere in the .Net Framework (as my colleague Dan Thompson was doing).
Now okay, perhaps the full UCS-2 to UTF-16 series is out of the reach of many.
But isn't it reasonable to expect that TextBox.Text will not produce a System.String that won't cause another piece of the .Net Framework to throw?
I mean, it isn't like there is a chance in the form of some event on the control that tells you of the upcoming truncation where you can easily add the smarter validation -- validation that the control itself does not mind doing.
I would go so far as to say that this punk control is breaking a safety contract that could even lead to security problems if you can class causing unexpected exceptions to terminate an application as a crude sort of denial of service.
Why should any WinForms process or method or algorithm or technique produce invalid results?
The problem seems to extend beyond just WinForms. I just made a simple WPF application with a TextBox with MaxLength="20". When I pasted "0123𠀀" into the text box, it displayed "0123𠀀", but when I pasted "0123401234012340123𠀀", it displayed "0123401234012340123��", just like the WinForms app you demonstrated. It seems no .NET GUI is safe.
(I tried posting this comment from Opera, but I don't think it worked.)
I also find that in IE8, Firefox and possibly other browsers on Windows platform, webpage with textbox that specifies maxlength attribute inherited(?) this behaviour. The final character shows fine if not exceeded the limit, but breaks if "on" the limit. (This is amazing because controls on browser pages is not real windows control, just something that emulates the controls' behaviour. I thought there would be good chance that IE would have worked around this.)
Fortunately most web application runtimes(.NET, ASP, JSP, PHP, etc.) code is prepared to deal with something like this (who needs textbox to send broken string when you can post it directly?), or this may open new vector of vulnerabilities.
The whole notion that characters = UTF-16 code units has been broken since the mid-1990s. Usually this is explained away by saying that nobody except character geeks ever needs to worry about non-BMP characters, and that is just about as hollow an excuse as they come. It's like saying it's OK for MaxLength to fail if the string consists of all A's.
Doug -- if you take the full linguistic character notion of the series then even UTF-32 code points is insufficient, but the minimum bar in UTf-16 should be code points, at least.
Aaron, Cheong -- yes, the problem exists elsewhere. And it is probably a separate bug to be fixed in each place it pops up. :-(
"And it is probably a separate bug to be fixed in each place it pops up"
Yep, the WinForms TextBox uses the Win32 EDIT control. Its EM_LIMITTEXT has the same issue.
And I wonder what SQL Server does if you put this string in a nvarchar(20) column. Too lazy to check :)
Depends on which of the thousand odd ways to insert you use, they will be just as bad in SQL Server.
> full linguistic character notion
Cutting off the low surrogate and leaving the high surrogate is much, much worse than cutting off combining diacritics and leaving the base character, from the TextBox standpoint. At least you still end up with a valid character string.
I think I made it clear I thought that was worse. :-)
Who decided that cutting off your string and beeping at you was good UI, anyway? Surely it'd be better to just not let the user submit the form until they delete enough stuff to fit [and provide interactive feedback about the limit, a la twitter]
Well, I'm only half-serious - i'm sure this was easier with the limitations programmers had to work with in 1983 - but surely no new programs should be using this "feature".
I remember this:
I find that article to be rather naive, alarmist, and biased, myself.
.NET I think MaxLength needs protection to assure safer text (blog MSDN de Michael Kaplan) C'est sûr
Pingback from Should UTF-16 be considered harmful? | DIGG LINK
Like many of the people I know, I find myself looking over at Stack Overflow and related sites periodically