Baby Names

I recently finished reading Freakonomics.  It is a fascinating book about a number of strange phenomena.  Its topics range from the economics of dealing crack to cheating in sumo wrestling.  Among the sundry topics is a discussion concerning the psychology and sociology underlying babies names.

This topic has interested me ever since I found out that my wife and I gave our first two children some of the most popular names for their birth year.  We did not intentionally look for popular names, but we picked them any way.  I always wondered how it was that I picked with the crowd.  I had theories but I didn't have any data.

So when I heard Levitt and Dubner's theory about why some baby names are so popular, I was naturally very curious.  Their hypothesis is that affluent and successful families set the trend (not celebrities).  Some baby names begin to take hold in affluent circles.  Then when less successful parents are looking for a baby name, they choose the names of children of privilege and opportunity.  Thus the name continues to extend its influence from one family to another.  Eventually, everyone is giving their child the name (often misspelling it).  Finally as more and more of the common folk use the name, the elitists stop using the name.

Their theory seemed probable enough and they had some good data to back it up, but I had my doubts.  I certainly didn't feel like I picked my son's name because I thought some other child's opportunities would rub off on him.

Later on, I was in need of an app to test Linq to Objects query performance over a large dataset.  Unfortunately, we didn't have a suitable app and I didn't have a lot of time.  Furthermore, I didn't have a large easily accessible in-memory dataset to work with.  So I decided to write a quick little app to screen scrape the Social Security Administration's Popular Baby Names site.  The app pulled down the top one hundred most popular names for every year in every state by gender since the year 1960.  I ended up with 40 megabytes of XML where each element looked something like this:

<PopularName state="Alaska" year="1960" rank="5" gender="Female" name="Susan" count="50" />

I then wrote an app that loaded all of the data into memory.  Each XML element became a PopularName object which has a property for each attribute in the XML.  These names were stored in a local variable called names of type List<PopularName>.

I then wrote a number of queries against the baby name dataset.  One of the queries shows the number of children named that name by year.  This query is run by calling NameUsage and passing in names and the name to use in the query.

NameUsage(names, "Wesley");

Where the body of the method looks like:

static void NameUsage(IEnumerable<PopularName> names, string searchName)
{
  Console.WriteLine("{0} Usage", searchName);
  var q = from name in names
          where name.Name == searchName
          group new { name.Name, name.Count } by name.Year
            into g
            orderby g.Key ascending
            select new { Year = g.Key, TotalCount = g.Sum(x => x.Count) };

  foreach (var item in q) Console.WriteLine(item);
}

This particular query displays:

Wesley Usage
{ Year = 1960, TotalCount = 107 }
...
{ Year = 2005, TotalCount = 159 }

Here is a graph of the data for the usage of the name "Wesley".

So it seems that my parents were victims of their time as well.  But is it only me and my children?

Apparently not.  My wife was also given her name during its period of popularity.  Note that these names are not necessarily really popular names.  Even so, the giving of various names seems to ebb and flow.  It is fascinating to think about how this behavior emerges from the vast array of parents seeking the best name for their child.

 

Nameless Keys

Another one of queries that I wrote listed the most popular names overall.  I wanted to distinguish names by gender usage (Terry, Pam, ...) but how can we do that with queries?

What I really want is to make the equality of the names based on the name itself and the gender usage.  In C# 3.0, we added anonymous types.  These neat little guys are very useful and one of the ways that they are the most useful is as composite keys.

Anonymous types have structural equality semantics within an assembly.  This means that if two anonymous types are created with the same property names in the same order and of the same type then they will be the same type and if the values are the same then the two instances will be equal and have the same hashcode.

We can use these facts to write queries which define equality between objects on several values.  In this case equality depends on the name and the gender.  So in the group...by clause we will group not on the name but on an anonymous type with members corresponding to the name and to the gender of the item.

static void TopNames(List<PopularName> names, int count)
{
  Console.WriteLine("Top {0} Names", count);
  var q = (from name in names
           group name.Count by new { name.Name, name.Gender }
             into g
             let TotalCount = g.Sum()
             orderby TotalCount descending
             select new { g.Key.Name, g.Key.Gender, TotalCount })
          .Take(count)
          .Select((x, rank) => new { Rank = rank + 1, x.Name, x.Gender, x.TotalCount });

  foreach (var item in q) Console.WriteLine(item);
}

Anonymous types can be used as composite keys in other query clauses such as join and orderby.  We can also use them to add multi-argument support to our memoize function.  We will use them as multi-argument keys in the map contained in the memoization function.

public static Func<A, B, R> Memoize<A, B, R>(this Func<A, B, R> f)
{
  var map = new Dictionary<???,R>();
  return (a, b) =>
  {
    var tuple = new { a, b };
    R value;
    if (map.TryGetValue(tuple, out value))
      return value;
    value = f(a, b);
    map.Add(tuple, value);
    return value;
  };
}

Memoize takes a function of two arguments of types A and B respectively.  It also returns a function of two arguments of types A and B.  What is different is that when the two arguments are passed into the lambda then they are put in a tuple and then it checks the map to see if that tuple is already in the map.  Essentially, we are using the anonymous type to form a composite key of the two arguments passed to the lambda.

Mumbling

But how can we create a Dictionary from an anonymous type to type R?  While it is easy to specify the type of map using the contextual keyword var even though the type doesn't have a speakable name, it isn't obvious how to specify the type parameters to the Dictionary constructor when we want to instantiate the type to an anonymous type.

We can get around this problem by introducing a new helper class.

static class DictionaryHelper<Value>
{
  public static Dictionary<Key, Value> Create<Key>(Key prototype)
  {
    return new Dictionary<Key, Value>();
  }
}

Here we put the type parameters that we can name (Value in this case) on the helper class.  Then we create a static method in the helper class that takes the remaining type parameters (Key in this case)  but also takes one parameter of the same type for each type parameter.  This is so we can pass in parameters that will be used by type inference to infer the unspeakable type parameters.  we therefore do not need to specify these types.

This means that we can replace the map creation in Memoize with the following code.

var map = DictionaryHelper<R>.Create(new { a = default(A), b = default(B) });

We specify one of the type parameters (R) of the Dictionary explicitly, but we specify the other (an anonymous type) implicitly by providing an example of the type.

I love using anonymous types as composite keys because they define equality and hashing semantics in terms of their members.  So next time you need a composite key, try using anonymous types.

In any case, now that my wife and I are expecting our third child, I have been writing a number of queries against this dataset to understand the ebb and flow of baby names.