Type inference woes, part four

Type inference woes, part four

  • Comments 8

Last time in this series I discussed how we are probably going to identify a "best" type from a set of expressions. Clearly we need to solve this problem for features such as implicitly typed arrays. We are also considering extending this algorithm to affect type inference of generic methods.

As always, the reason we're considering this is emminently practical; we try hard to not add new language semantics just for the heck of it, but rather in response to specific user problems. Before I get into the motivating problem, let me describe a simpler case that might look familiar:

void M<T>(T t1, T t2){}
...
M(myShort, 0);

In our currently shipped C# compiler, the type inference algorithm insists that all inferences be 100% completely consistent. T cannot be both int and short, so inference fails in this case.

It might be kind of nice to say, you know what, the "best type" algorithm would tell us that short is best, so why don't we use it?

Of course, there are some situations where that stops working. We'd still need this to fail, for instance:

void M<T>(List<T> t1, List<T> t2){}
...
M(myListOfShorts, myListOfInts);

Generic types are neither covariant nor contravariant, so neither choice works in this case. However, it does work in this case:

void M<T>(Func<T> t1, Func<T> t2){}
...
M(()=>myInt, ()=>myShort);

because a lambda which returns a short is convertible to a delegate which returns an int, so inference to int could succeed.

The motivating example for all of this is the join scenario. For those of you who are not database wonks, joining is one of the fundamental operations on tabular data. You typically have two tables with some relationship between them. For instance, you might have a list of customers, where each customer has a customer id, and a list of orders where each order has an associated customer id. You "join" the tables by producing a new table of records associating each customer with their orders. Maybe the new table is the customer name and the amount of each of their orders.

To represent a generic join operation as a method we need two tables of records, a key extractor for each table, and a projector from the joined records to the output records. The caller might look something like:

var results = Join(
    myCustomers,
    myOrders,
    customer=>customer.Id,
    order=>order.CustomerId,
    (customer, order)=>new {customer.Name, order.Price});

The fully generic method might look something like:

IEnumerable<RES> Join<REC1, REC2, KEY, RES>(
    IEnumerable<REC1> rec1s,
    IEnumerable<REC2> rec2s,
    Func<REC1, KEY> rec1KeyExtractor,
    Func<REC2, KEY> rec2KeyExtractor,
    Func<REC1, REC2, RES> projector) { ... }

So now perhaps you see the problem. What if the customer key type is int but the order table's customer key type is Nullable<int> ? We would like to infer that Nullable<int> is the better choice for KEY but today the type inference algorithm insists on 100% consistency.

As you can see, this is rapidly becoming complex. We would like to come up with an inference algorithm that solves this problem, but at the same time can be clearly described in the standard and understood by mortals. (Where by "understood by mortals" I mean "such that Eric has some chance of implementing the algorithm correctly"!) Any thoughts or comments you might have would be appreciated; the language design committee will be debating these issues this week and so this is the time to give feedback about these things.

  • PingBack from http://microsoft.wagalulu.com/2006/07/18/type-inference-woes-part-four/
  • I'm probably missing something important here, but isn't an int implicitly convertible to a Nullable<int>?  In which case, couldn't the KEY in this example be inferred using one of the algorithms in part 3?

    I think I remember this example coming up earlier and you asked if the algorithm should infer a Nullable<int>, and I'd say that makes perfect sense.  If you were trying to join a Nullable<short> and an int, then it would have to produce an error, but in the case of a Nullable<int> and anything implicitly convertible to an int, it makes sense that Nullable<int> should be the inferred type.

    Further to that, since this example really applies primarily (I'd assume) to LINQ and is really a DB-oriented concept, it's probably going to end up getting used most by the database wonks like myself, who will expect it to work the same as a SQL join, which could include null values if it were an outer join.  I think the C# team has also made a few other decisions which favour SQL behaviour although I can't remember offhand what they are.

    Does inferring a Nullable<int> from the example here conflict with any of the inferrence algorithms you put forth before?
  • No, there's no conflict.  The problem lies in specifying exactly what the inference algorithm should be.

    In my next post I'll describe some of the weird issues you run into when you start applying unification algorithms during type inference on lambdas.
  • I think there's a straightforward answer to this question but it's a little fuzzy in my head at this point, and naturally you'll have thought about this more than I ;)

    Your example:

    var results = Join(
       myCustomers,
       myOrders,
       customer=>customer.Id,
       order=>order.CustomerId,
       (customer, order)=>new {customer.Name, order.Price});

    Could this not be specified instead as:

    var results = Join(
       myCustomers,
       myOrders,
       (customer, order) => customer.Id == order.CustomerId,
       (customer, order)=>new {customer.Name, order.Price});

    and doesn't that eliminate the problem for this case because you no longer have to infer anything about that particular lambda, and the == operator already has its behavior very well defined?
  • @Stuart Ballard: That would work, but it would execute incredibly slowly.  Without knowledge of the keys, the join algorithm would have to compare every row in the first table with every row in the second table to see if they matched or not.


    Eric Lippert: "In my next post I'll describe some of the weird issues you run into when you start applying unification algorithms during type inference on lambdas."

    This may be why you haven't had many comments; without going through the entire design exercise ourselves, it's not possible to offer any constructive criticism.  Or maybe everyone's on holiday.  Anyway, I'm looking forward to the next post.
  • Right -- Stuart, what you are describing is just a filter on the cross product of the two tables.  Running such a filter naively on two tables of size n is typically an n squared operation.

    The benefit of doing an equijoin is that because you know that there is an equality relation you can use hash tables to massively speed up the join.  An equijoin can typically be done in linear time.
  • Hmm... this doesn't really help with the type inferrence problem, but now I'm curious... SQL allows any kind of join operator using a syntax similar to Stuart's example, and chooses the best query plan depending on that operator.  For equality, it's a hash match, and for most others, it has to be an inner loop.  But the point is that it's "smart enough" to figure that out based on the syntax, somehow (maybe it's because the type system is a lot less complicated in an RDBMS).

    Point being, people might want to use the "naive" join in some cases (I can't think of any examples offhand, but SQL syntax wouldn't support it if there was no reason to ever use it).  So in that regard, it seems kind of silly to have a special syntax for an equijoin.  I think that the syntax actually should look like Stuart's example and that the compiler should determine, based on the operator and the types, whether or not it's possible to use a hash match.  Of course, I don't have .1% of the knowledge that you do on this subject, so that could very well be a ridiculous expectation.

    But if C# were able to determine the best matching algorithm based on the syntax, like SQL does, then in cases where a consistent type can't be inferred, the join can still be done the "lazy" way if there's an appropriate comparison operator.  And in some cases that might be fine.  An n-squared operation isn't the end of the world if there are only a few hundred records.

    Also, this might be way out in left field, but do you really even need a consistent type for hashing?  Technically, I would think that the only requirement here is that the type be able to generate a hash value, which every type in .NET does.  In the non-generic Hashtable you can throw any old object in as a key.  There's probably an issue with hashing quality and collisions if you use different types, but I'd assume that Nullable<int> would generate identical hashes to int for non-null values...

    I kind of get the feeling that I'm in over my head and that there's a whole lot of background one has to understand before being able to make a useful comment.  But hey, I tried. :-)
  • My version isn't a filter on the cross product if the type of the parameter in question is Expression<Func<Customer,Order,bool>> (or whatever the precise syntax is)...
Page 1 of 1 (8 items)