# Spot the defect: Bad comparisons, part three

### Spot the defect: Bad comparisons, part three

Rate This

Did you notice how last time my length comparison on strings was unnecessarily verbose? I could have written it like this:

static int ByLength(string x, string y)
{
if (x == null && y == null) return 0:
if (x == null) return -1;
if (y == null) return 1;
return CompareInts(x.Length, y.Length);
}

static int CompareInts(int x, int y)
{
// Positive if x is larger, negative if y is larger, zero if equal
return x - y;
}

static Comparison<T> ThenBy<T>(this Comparison<T> firstBy, Comparison<T> thenBy)
{
return (x,y)=>
{
int result = firstBy(x, y);
return result != 0 ? result : thenBy(x, y);
}
}

Much nicer! My string length comparison method is greatly simplified, I can compose comparisons easily with the ThenBy extension method, and I can reuse CompareInts as a helper function in other comparisons I might want to write.

What's the defect now?

.

.

.

.

.

.

This one should be easy after last time. The lengths of strings are always positive integers and in practice they never go above a few hundred million; the CLR does not allow you to allocate truly enormous multi-gigabyte strings. But though CompareInts is safe for inputs which are string lengths, it is not safe in general. In particular, for the inputs Int32.MinValue and Int32.MaxValue, the difference is 1. Clearly the smallest possible integer is smaller than the largest possible integer, but this method gives the opposite result! CompareInts should read:

static int CompareInts(int x, int y)
{
if (x > y) return 1;
if (x < y) return -1;
return 0;
}

The moral of the story here is that a comparison function that doesn't compare something is probably wrong. Subtraction is not comparison.

• Sometimes I think that instead of returning an int, Compare functions should return an enumeration:

enum CompareType //Built in.  With less sucky naming.

{

LeftLarger = 1,

Equal = 0,

LeftSmaller = -1

}

static CompareType CompareInts(int x, int y)

{

if (x > y) return CompareType.LeftLarger;

if (x < y) return CompareType.LeftSmaller;

return CompareType.Equal;

}

The meaning of returning an int to a comparison function strikes me as non-obvious, despite its popularity.

• I would have used x.CompareTo(y)... It also shows the intent much better than x - y

• But subtraction _is_ comparison, if you take the overflow and carry flags into account!

• Brian, the value returned by the function could be used by the sorting function as an optimization hint.

It goes something like this:

The further away from zero the returned value is, the more 'unequal' both items are.

An optimized sorting function could use this knowledge in an attempt to skip a few items.

Diff :  A - B = 1

Diff :  E - F = 1

Diff :  A - E = 5

Therefore comparing A to F could be skipped since:

A and B are real close,

E and F are real close

A and E are really far appart

therefore A and F could never be close.

• I had the same thought as Thomas.  Why would you not use x.CompareTo(y)?

• @Bas:

But then your spec for a comparison function would have to say that the value is not only of a certain sign, but also proportional to the difference.

The comparison functions most of us use are only spec'ed as returning positive, negative, or zero, with no importance given to the magnitude of the positive or negative return values. If your sorting function assumes something about the magnitude of those values without the spec requiring it, you are wandering into very dangerous territory.

• @Austin, to steal something Eric has said before, if CompareTo did not exist, how would you write it? Because all of the implementations of CompareTo, somebody had to write those. And people are still writing comparison functions, and they need to write good ones. No doubt many if not all of the examples presented in this blog series are from bug reports, either internally at Microsoft or from customers.

• Substraction *is* a comparison for unsigned integers. For example, this is wide spread technique in embedded development.

• Hmm, I'm not sure how meaningful all this is anymore - if you're defining custom comparers, that's generally because you've a specific usage in mind: one in which you generally will not deal with all possible values the type system permits.  In fact, it's generally unavoidable to have errors for values that are legal by the type system but semantically illegal (though indeed a wrong value is generally a worse type of error than some form of exceptional termination).

So, as I see it, these comparison functions aren't so much invalid, they simply have a smaller domain than the type system indicates.  The alternative - verifying that inputs *are* in range or using comparisons - is significantly slower; so it's not easy to dismiss out of hand.

And in general, it's not feasible for functions to always verify their input anyhow - so the focus on *these* border cases is nice, but hardly critical.  Having undefined behavior when preconditions are violated is common in all kinds of software - probably almost all software.

I'd call these spec-bugs: the functions are in practice perhaps even superior to the "correct" versions - but their preconditions should be documented.

• @Bas @Mwf: The optimization hint scenario is feasible, though compare does not promise any meaning on the int, so it is only practical under two situations:

1.  The algorithm (e.g., sort) will still run successfully (and within any guarantees of speed it makes) even if the distance from 0 is arbitrary.

2.  Only specialized libraries should use this type of optimization, since people will probably not be expecting this kind of optimization, and it may slow things down.

Anyhow, this kind of optimization strikes me as the kind of thing that should *NOT* be built into the framework.  The people who need specialized comparisons can do some combination of using their own comparison libraries and just casting the enumeration to an int.

• @Sergey: No, it isn't.  The range of the result of a subtraction of values constrained only by type is always twice the range of the type, even if the type is unsigned, so wraparound is still problematic.  Of course, as Eric alluded to with the size of strings argument, if the values are constrained to a subset of the values allowed by the type, it may work.

• I've seen the x - y and -cmp idioms so many times -- even in prominent examples of "well-written" code -- that I've never thought to question them. Needless to say, I've grepped all my source code to check my comparisons now. Found one bug (-cmp to reverse), and one x - y which was okay because of restricted range, but I've added a comment to it now.

So thanks, Erik, for these posts! :-)

@Bas: That "optimization" sounds pretty ridiculous. :-)  I don't think you can safely glean information from any general comparison function. (Maybe you can make some assumptions if you're only sorting integers.) From your example, it does seem clear that A need never be compared to F, but for a different reason:

E - F = 1 (implies E > F)

A - E = 1 (implies A > E)

A > E > F... so A > F.

• The origin is probably strcmp - strcmp did this in the original implementation in unix, and this is what the "may return any appropriate-sign value" contract was invented for. It was safe then, since what was being subtracted was chars which are smaller than ints, but it got it into people's heads that this is for some unknown reason a terribly clever thing to do (it avoids one or two branches, which was relevant on the PDP-11 but not so much today, and nevermind that strcmp requires three branches _per character_ until the point they are different)

@Bas Mommenhof: What kind of sorting algorithm could make use of this and be faster [including the extra work for the comparison function to decide if the inputs are "close" or not] than a standard sorting algorithm? Also, this is an extra contract on top of it, and doesn't fit with current usage [with traditional strcmp implementations, "aa" "ab" results in 1, "aaaa" "aaaz" results in 25.

• I'd chalk this problem up not to (the use of) comparisons but to the implementation of numbers in programming languages like C#. Mathematically speaking it's rather strange to have several different, bounded, types to represent numbers - and these problems highlight the problems that come with them. After all, Int32.MinValue - Int32.MaxValue == 1, WTF??? Of course in the olden days it used to be the case that representation of numbers had to be constrained by a few bytes at most, because of the inherent speed and memory limitations we had back then. But I'd argue that the present use of Int32's (and other Int## types) are dinosaurs from those early days.

Progress has been made to eradicate many of the other early unwieldy/'dangerous' language structures (however useful they were back then - e.g. use of pointers, use of 0/1 for true/false, etc.) and it is high time we took another good look at the representation of numbers. I'm pretty sure there are programming languages out there that only have one notion of number, and do everything with it. (Well, maybe two - one discrete, one continuous.) Int32's and its friends may be retained for backwards compatibility and/or speed for programs that require them, but I say let's have a new all-encompassing Number type.

• @Everyone: I am not actually advocating this.

Im only stating what I think I remember i have read a long time ago.

Brain asked where the int was comming from, this was a possibility.

I am not too happy about everyone disagreeing with it, because that could be an indication that my memory is failing.

I must be getting too old. (I still remember floppy disks.)

Page 1 of 2 (22 items) 12