Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
My next book project is ready for copyediting, the wedding invitations are sent out, my bug count is under control, I've got a layer of levelling compound poured in the basement, there's no more ivy on my roof, and I’m rigging my sailboat some time this week (about six weeks later than I would have liked to, but let’s not be picky.) Life is getting back to normal in my little corner of Seattle. So what better time to continue with my little exploration of high-dimensional spaces?
Last time we noticed a few interesting facts:
Suppose you have a bunch of points in space scattered randomly around some average. You would expect that if the points are more or less evenly distributed then the number of points in a given volume of the space would be proportional to the volume. (In fact, that’s probably how we’d define "more or less evenly distributed.")
Let’s think of a low-dimensional example – points on a line. For example, suppose you've got a thousand 30-year-old women and you measure their heights. If the average is 150 cm, then you'd expect that the number of women that fall into the 2-cm range from 145-147 is roughly the same as from 150-152. There will probably be more in the range closer to the mean, but you'd expect that the order of magnitude would be roughly the same. It's not like there's going to be 100 in one range and 10 in the other. I'm assuming here that we're considering regions close to the mean so that the distribution is pretty even throughout.
Now suppose you take a 2-cm range and a 1-cm range, both close to the mean. You would similarly expect that the 1-cm range would have roughly half as many points in it as the 2-cm range, right? The number of points in a region is roughly proportional to the size of the region.
What if the distribution is 2-dimensional? Suppose now you plot both the heights and hair lengths of these 1000 women. Again, there will be some average around which the points cluster, say (150 cm, 30 cm). Now imagine that you draw two circles of the same area close to the mean. Again, you'd expect that two circles of the same area would have roughly the same number of points inside them. The circle with its center closer to the mean probably has a few more, but you'd expect roughly the same number of points in each circular region.
Now what if you make one of those circles half the radius of the other? Then it will have only one quarter of the area, and therefore only on average one quarter as many points.
Now suppose that you have a bunch of points scattered around a 100-dimensional space, all clustered around a particular average point. Pick a hyperspherical region centered on the average, with radius large enough to contain most of the points.
We expect to find more points in regions of the space with more volume. We know that 99% of the volume of this n-sphere is concentrated in a thin shell near its surface. The conclusion is inescapable: points clustered around a mean in high dimensions are far more likely to be found in the high-volume narrow shell far from the mean, not in the low-volume region near the mean itself!
This again seems really counterintuitive at first but its not if you think about it for a while. In our example above, most of the women are of near average height. And most of the women are of near average hair length. But clearly the number of women that are near-average height AND near-average hair length is considerably smaller than either taken separately.
As we add more and more dimensions to the mix, the number of points where all measurements are near the mean point gets very small indeed. Of the six billion people in the world, how many of them are within 1% of average height and average hair length and average age and average income and average cholesterol level and average IQ? Probably a very, very small number. Most everyone is slightly abnormal somehow! And in a high-dimensional space, many small deviations from the mean adds up to a large distance from the mean. If you plotted 100 characteristics of 6 billion people in a 100-dimensional space, the number of people very near the middle of every axis is quite small; there's hardly any volume there.
What the heck does this have to do with computer science? Suppose you have all that personal data in a database, you have a dead guy in the morgue with no ID, and you want to find a match in your database, if there is one? What if you have 100 variables about a song and you want to identify whether a given scrap of music is in your database? There are lots of real-world search problems that have sparse, large data sets spread out over a high-dimensional space.
Next time we'll dig into that a little bit more.