Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What's wrong with this code, part 3

What's wrong with this code, part 3

  • Comments 44

This time, let’s consider the following routine used to determine if two strings are equal (case insensitively).  The code’s written in C# if it’s not obvious.

            static bool CompareStrings(String string1, String string2)
            {
                  //
                  //    Quick check to see if the strings length is different.  If the length is different, they are different.
                  //
                  if (string1.Length != string2.Length)
                  {
                        return false;
                  }

                  //
                  //    Since we're going to be doing a case insensitive comparison, let's upper case the strings.
                  //
                  string upperString1 = string1.ToUpper();
                  string upperString2 = string2.ToUpper();
                  //
                  //    And now walk through the strings comparing the characters to see if they match.
                  //
                  for (int i = 0 ; i < string1.Length ; i += 1)
                  {
                        if (upperString1[i] != upperString2[i])
                        {
                              return false;
                        }
                  }
                  return true;
            }

Yes, the code is less efficient than it could be, but there’s a far more fundamental issue with the code.  Your challenge is to determine what is incorrect about the code.

Answers (and of course kudos to those who found the issues) tomorrow.

 

  • Well, first things first... You're not checking to see if the two passed in strings are null.
  • I don't know if this is what you mean, but one or both of the objects coming in could be null, causing a NullReferenceException. a better way would be checking for null and either throwing an ArgumentException, although throwing exceptions from this type of code is not recommended from Microsoft practice. We could also choose to return false, but that would cause false negatives, which would impact our business code.
  • Well, for starters, the code assumes that a character by character comparison is valid. I'm no globalization expert, but I don't think this applies to certain languages (I'm thinking Asian languages like Chinese). You'd want to use the StringInfo class to do culture-aware string breaking of each string, then compare the results.

    For that matter, does ToUpper() make sense for these languages? I imagine there is no upper case equivalent to a Chinese character. :-)

    Jason
  • Changing the case of a string is always fishy when it involves a language without the concept of upper/lower case.

    It's also suspicious that the for loop checks string1.Length instead of upperString1.Length.

    It is possible that string1.Length != upperString1.Length or string2.Length != upperString2.Length? (I don't program in .Net languages so the inner workings of .Net strings are a mystery to me, I'm just going on my knowledge of C-style strings and all the "fun" things you run into with localization.)
  • Steve/Martijn: You're right, I didn't think about null parameters. My personal thinking is that just letting the null reference exception be thrown isn't that bad an idea, since it's a programming error on the caller.

    Jason: For languages like Chinese that don't contain cases, it's my understanding that ToUpper works like "123".ToUpper() - in other words it doesn't change the string.

    Mike (and Jason): You nailed about 3/4ths of the answer: You can't rely on length comparisons when comparing strings case insensitively.

    There's still another huge issue in the code, but you're on the right track.
  • If upperString2.Length < upperString1.Length, then upperString2[i] inside the for loop will blow up. But something tells me that this is not what you are looking for :)
  • I'm guessing trailing spaces don't matter.

    If (string1.Trim.Length != string2.Trim.Length)
    ....

    incase you compare: "asdf" and "asdf ",
    but then again, I guess those strings are not equal.
  • In Unicode, you can represent an "e" with an acute accent as either a single character or as two characters, the "e" and the acute accent combining-character. It doesn't cope with this kind of thing.
  • Those strings need to be trimmed, getting rid of any excess whitespace, control characters, etc. How reliable is toUpper when dealing with special characters? Why use the original string1 variable as the for loop control when you are clearly validating the UpperStrings?

    And lastly, if doing a comparison, I would think one should never alter the original variable storing the data which is being compared. It almost defeats the purpose.
  • Prasanna: Right, because the code has an implicit assumption that string1.Length == string1.ToUpper().Length.

    But that's not the remaining issue.
  • absolutely no clue, but I guess when the upper strings aren't the same length, you might get IndexOutOfRangeException.
  • Greg: Why do the strings need to be trimmed? I never said that the check was for tokens.

    In this routine, "asdf" != "asdf ".

    carlos: Yup, thats the essence if Mike Dunn's comment.

    There's still another issue though (although the suggestions above are on the right track).
  • Actually, this code is wrong on more than one count...

    For starters, checking the input values for null for the two strings is required! Passing a null value as one of the strings to the current implementation generates an exception.

    Besides, this implementation issue, there is a more fundemental flaw to this function: this function does not compare strings but only code points.

    The .NET Framework uses Unicode for string handling and one of the properties of Unicode is that some characters can be represented in different ways (e.g. é (e acute) can be represented as a single 16-bit code point in UTF-16 or as two code points (e + an acute combining character which is implemented as e\u0301.

    Checking the length of the strings and using a naive implementation as above will actually return false when comparing the following two strings "abcé" and "abce\u301" which is plain wrong when you are dealing with strings as these are two equally valid representations of the same string. In comparison, the implementation of String.Compare in the .NET FX handles this case very nicely and returns the appropriate result.

    Therefore the function should be renamed to CompareCodePoints or CompareOrdinal or something similar as it is really what it does.

    However, even in this case, there is still another (albeit minor in this case as the two uppercasing transforms are done with the same culture) issue. It should really use a culture invariant uppercasing (pass CultureInfo.InvariantCulture as a param to the ToUpper method.
  • I think I read some where in Raymond Chen's blog that there is wide byte UNICODE characters. That is a language alphabet (letter) takes more than one string characters.

    If the above one is true, then there may be possibility that a letter can be represented in more than one way without involving lowercase/uppercase.

  • I'd say that it may be possible that upperString1.Length != upperString2.Length and upperString1.Length != string1.Length that you can get Index out of bounds exceptions in the code and you don't have try catch logic in the code to handle this possibility.
Page 1 of 3 (44 items) 123