Why does char convert implicitly to ushort but not vice versa?

Why does char convert implicitly to ushort but not vice versa?

Rate This
  • Comments 17

Another good question from StackOverflow. Why is there an implicit conversion from char to ushort, but only an explicit conversion from ushort to char? Why did the designers of the language believe that these asymmetrical rules were sensible rules to add to the language?

Well, first off, the obvious things which would prevent either conversion from being implicit do not apply. A char is implemented as an unsigned 16 bit integer that represents a character in a UTF-16 encoding, so it can be converted to or from a ushort without loss of precision, or, for that matter, without change of representation. The runtime simply goes from treating this bit pattern as a char to treating the same bit pattern as a ushort, or vice versa.

It is therefore possible to allow either implicit conversion. Now, just because something is possible does not mean it is a good idea. Clearly the designers of the language thought that implicitly converting char to ushort was a good idea, but implicitly converting ushort to char is not. (And since char to ushort is a good idea, it seems reasonable that char-to-anything-that-ushort-goes-to is also reasonable, hence, char to int is also good.)

Unlike you guys, I have the original notes from the language design team at my disposal. Digging through those, we discover some interesting facts.

The conversion from ushort to char is covered in the notes from April 14th, 1999, where the question of whether it should be legal to convert from byte to char arises. In the original pre-release version of C#, this was legal for a brief time. I've lightly edited the notes to make them clear without an understanding of 1999-era pre-release Microsoft code names. I've also added emphasis on important points:

[The language design committee] has chosen to provide an implicit conversion from bytes to chars, since the domain of one is completely contained by the other. Right now, however, [the runtime library authors] only provide Write methods which take chars and ints, which means that bytes print out as characters since that ends up being the best method. We can solve this either by providing more methods on the writer class or by removing the implicit conversion.

There is an argument for why the latter is the correct thing to do. After all, bytes really aren't characters. True, there may be a useful mapping from bytes to chars, but ultimately, 23 does not denote the same thing as the character with ASCII value 23, in the same way that the byte 23 denotes the same thing as the long 23. Asking [the library authors] to provide this additional method simply because of how a quirk in our type system works out seems rather weak.

The notes then conclude with the decision that byte-to-char should be an explicit conversion, and integer-literal-in-range-of-char should also be an explicit conversion.

Note that the language design notes do not call out why ushort-to-char was also made explicit at the same time, but you can see that the same logic applies. When passing a ushort to a method overloaded as M(int) and M(char), odds are good that you want to treat the ushort as a number, not as a character. And a ushort is not a character representation in the same way that a ushort is a numeric representation, so it seems reasonable to make that conversion explicit as well.

The decision to make char go to ushort implicitly was made on the 17th of September, 1999; the design notes from that day on this topic simply state "char to ushort is also a legal implicit conversion", and that's it. No further exposition of what was going on in the language designers' heads that day is evident in the notes.

However, we can make educated guesses as to why implicit char-to-ushort was considered a good idea. The key idea here is that the conversion from number to character is a "possibly dodgy" conversion. It's taking something that you do not know is intended to be a character, and choosing to treat it as one. That seems like the sort of thing you want to call out that you are doing explicitly, rather than accidentally allowing it. But the reverse is much less dodgy. There is a long tradition in C programming of treating characters as integers -- to obtain their underlying values, or to do mathematics on them.

In short: it seems reasonable that using a number as a character could be an accident and a bug, but it also seems reasonable that using a character as a number is deliberate and desirable. This asymmetry is therefore reflected in the rules of the language.

  • One thing I've always wondered is: why is there no implicit char->string conversion? It is also lossless, it has definite semantic meaning that is intuitive, and it is generally handy. There's also precedent for it (e.g. Pascal/Delphi). If you allow char->ushort, then it seems like a no-brainer.

    Of course, I realize that this is one of those "why doesn't it have X?" questions that should really be "why should it have X", and I strongly suspect that originally it simply wasn't considered in 1.0, and changing it later can potentially break existing overloads. But still, perhaps there is some specific design rationale for it?

    It was considered in v1.0. The language design notes from June 6th 1999 say "We discussed whether such a conversion should exist, and decided that it would be odd to provide a third way to do this conversion. [The language] already supports both c.ToString() and new String(c)". -- Eric

  • More on the subject at hand - it is somewhat unfortunate that the extra conversion makes some overloads ambiguous. For example:

       static void Foo(char c, object o);

       static void Foo(int n, string s);

       Foo('a', "b"); // ambiguous

    Of course this is nothing new, it's just that it may be somewhat confusing in this case because a programmer might not expect there to be an implicit char->int conversion (especially if he already discovered that there's no implicit {any-integral-type}->char conversion before). And the compiler doesn't spell out conversions that it considers when performing resolution - it just lists all method signatures that it ends up with.

  • @Eric:

    Hm, but neither `char.ToString()` nor `new String(char)` are implicit, and the biggest benefit of it is to do things like calling Foo(string) while only having a char, with no additional syntax for a cast or conversion.

  • @Pavel

    Then again, conversion of char to string requires allocating an object on the managed heap - something that should, IMO, never happen as a result of an implicit conversion.  Wordy though it is, I prefer to have the explicit .ToString() or new string() there to make the existence that new object clearly visible.

  • @Pavel, I'm pretty happy there's no implicit char -> string conversion.

    As you say, you've got to think about why it should be so, and the cases where programmers can make hard-to-track errors because of all this implicitness.

    It's almost like asking for there to be implicit conversion from int to int[], double to double[], etc

    As far as I'm aware, only the params[] keyword does that, and it's a very special case of an implicit conversion.

  • @Carl, re implicit allocation on the heap, with boxing, that happens all the time.

  • @AC, the problem with an implicit int->int[] conversion (for example) is that the created array would have distinct object identity. This matters for arrays because they're mutable. So making it clear when a new instance of array is created is very important - which is why e.g. properties returning arrays are frowned upon, and StyleCop will shoot them down and ask you to rewrite them as GetXxx() methods instead, to emphasize that a new array is returned every time.

    There's no such problem with String - it's immutable, so a new instance being created is no big deal - no-one is going to try to write something into it, and then be surprised that the char lvalue from which it was implicitly created didn't mutate.

    XQuery, for example, treats all atomic values as sequences of length 1 (so xs:int can be substituted where xs:int* or xs:int+ is expected), and I never recall it being a problem there.

  • @AC - boxing happens, I wouldn't say all the time.  I'd expect that in modern .NET (2.0+) code boxing happens rather infrequently - calls to String.Format (and similar) likely being the most common case.

  • I'm actually surprised the conversion from char to ushort is implicit. Exposing the number a character happens to be represented by breaks encapsulation. A character is not a number.

    Consider C as an example. It chiseled the equivalence between bytes and characters right into the language. But characters aren't always a byte wide. Oops!

  • @Pavel,

    Since we're on the subject ...

    I fail to see what immutability or 'distinct identity' has to do with my drawing a comparison of char->string being similar to int -> int[]? A string is a heap-managed object that also has a distinct identity. By distinct I'm guessing you mean has a separate reference in memory, but every object has a distinct identity. "Immutability" is how some objects are carefully crafted to return new objects when mutated, but that's because it was designed that way.

    In the int implicit to int[] example, the caller of foo(int[]) doesn't have an array to start with, so mutating said array won't change anything. The temporary int[] is discarded after foo() is called.

    Also to say there is no overhead, or that implicit is OK because it's immutable is a little weak.

    Anyways, I'm not arguing for it. I don't like implicit conversions.

    PS languages that treat everything as a list are fine, but C# isn't one of them. I like jQuery as well. They both have their uses.

    @Carl I agree with you. I am just pointing out that there's implicit heap object creation (boxing) all the time. Generics and current framework aside, it's still there.

    :-)

  • @Strilanc the situation is a bit different from C - C# (or, rather, the CLI). In C it's not specified at all what encoding (or for that matter, that char is in fact eight bits, though as it is required to be the smallest addressable unit it's rather inconvenient not to) strings and chars use, whereas C# explicitly uses 16-bit unicode. Since it's well-defined, 'a' doesn't just "happen to" be 97, it absolutely _is_ U+0061 = 0x61 for all time. So it's perfectly viable to provide a fast way to get the numeric UTF-16 code unit that corresponds to a char. Though an implicit conversion isn't the only way to do it. For example, VB doesn't provide a conversion (explicit or implicit), instead it provides the AscW/ChrW functions, which always optimize to the same as the conversion.

  • That was a bit incoherent since I was going back and forth between two windows checking to make sure the standards actually said what I assumed they said. "16-bit unicode" is what the CLI standard says (though it mentions UTF-16 in one place), the C# standard specifically says UTF-16.

  • @AC

    Converting a character to a string has obvious and unambiguous semantics, but I have no idea what converting an int to an int[] would even be supposed to do. Would it give the decimal digits of the int? Would it give the binary digits? Would it just give an int[] of one element, that is the original int? Which one, or what else, and for what reason?

    What is even the use of such a conversion?

    I agree that even with char to string being unambiguous and obvious, the conversion should not be implicit, because while it's easy to understand what's exactly happening, it's also unexpected. But I don't think your int[] array example is a convincing argument, because I don't really think it makes nearly as much sense as the string conversion.

  • @Joren,  I dont see what is confusing. A string is conceptually an array of characters; so the conversion from a single char to a string is exactly the same as converting a single int (or any other type) into a single element array of that type.

    Other languages DO implement this.

    For the record, I would be dead-set against this being implemented for C#!

  • @Joren and @TheCPUWizard I'm also dead set against this as well. It was meant to be a counterpoint to implicit  char -> string conversions.

    There's already a perfectly good, simple, and explicit way to do this

    foo( new[] { myInt } );

Page 1 of 2 (17 items) 12