Fabulous Adventures In Coding

Eric Lippert's Blog

High-Dimensional Spaces Are Counterintuitive, Part Three

My next book project is ready for copyediting, the wedding invitations are sent out, my bug count is under control, I've got a layer of levelling compound poured in the basement, there's no more ivy on my roof, and I’m rigging my sailboat some time this week (about six weeks later than I would have liked to, but let’s not be picky.) Life is getting back to normal in my little corner of Seattle. So what better time to continue with my little exploration of high-dimensional spaces?

Last time we noticed a few interesting facts:

  • most of the volume of a high-dimensional object is very near its surface.
  • small movements away from a point in many dimensions add up to a long total distance.

Suppose you have a bunch of points in space scattered randomly around some average. You would expect that if the points are more or less evenly distributed then the number of points in a given volume of the space would be proportional to the volume.  (In fact, that’s probably how we’d define "more or less evenly distributed.")

 Let’s think of a low-dimensional example – points on a line. For example, suppose you've got a thousand 30-year-old women and you measure their heights. If the average is 150 cm, then you'd expect that the number of women that fall into the 2-cm range from 145-147 is roughly the same as from 150-152.  There will probably be more in the range closer to the mean, but you'd expect that the order of magnitude would be roughly the same. It's not like there's going to be 100 in one range and 10 in the other. I'm assuming here that we're considering regions close to the mean so that the distribution is pretty even throughout.

Now suppose you take a 2-cm range and a 1-cm range, both close to the mean.  You would similarly expect that the 1-cm range would have roughly half as many points in it as the 2-cm range, right?  The number of points in a region is roughly proportional to the size of the region.

What if the distribution is 2-dimensional?  Suppose now you plot both the heights and hair lengths of these 1000 women.  Again, there will be some average around which the points cluster, say (150 cm, 30 cm).  Now imagine that you draw two circles of the same area close to the mean.  Again, you'd expect that two circles of the same area would have roughly the same number of points inside them.  The circle with its center closer to the mean probably has a few more, but you'd expect roughly the same number of points in each circular region.

Now what if you make one of those circles half the radius of the other?  Then it will have only one quarter of the area, and therefore only on average one quarter as many points.

Now suppose that you have a bunch of points scattered around a 100-dimensional space, all clustered around a particular average point. Pick a hyperspherical region centered on the average, with radius large enough to contain most of the points.

We expect to find more points in regions of the space with more volume. We know that 99% of the volume of this n-sphere is concentrated in a thin shell near its surface. The conclusion is inescapable: points clustered around a mean in high dimensions are far more likely to be found in the high-volume narrow shell far from the mean, not in the low-volume region near the mean itself!

This again seems really counterintuitive at first but its not if you think about it for a while.  In our example above, most of the women are of near average height.  And most of the women are of near average hair length.  But clearly the number of women that are near-average height AND near-average hair length is considerably smaller than either taken separately. 

As we add more and more dimensions to the mix, the number of points where all measurements are near the mean point gets very small indeed.  Of the six billion people in the world, how many of them are within 1% of average height and average hair length and average age and average income and average cholesterol level and average IQ?  Probably a very, very small number. Most everyone is slightly abnormal somehow!  And in a high-dimensional space, many small deviations from the mean adds up to a large distance from the mean. If you plotted 100 characteristics of 6 billion people in a 100-dimensional space, the number of people very near the middle of every axis is quite small; there's hardly any volume there.  

What the heck does this have to do with computer science?  Suppose you have all that personal data in a database, you have a dead guy in the morgue with no ID, and you want to find a match in your database, if there is one? What if you have 100 variables about a song and you want to identify whether a given scrap of music is in your database?  There are lots of real-world search problems that have sparse, large data sets spread out over a high-dimensional space.

Next time we'll dig into that a little bit more.

Published Monday, June 06, 2005 3:57 PM by Eric Lippert

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Rajiv said:

Does it make sense to even consider one-dimensional measurements in n-dimensional space? I would think that in 3-d space (for example), you should be looking at a space that is half the volume, not half the radius.
June 7, 2005 8:55 AM
 

Sobolev said:

Real geeks do it with cohomology.
June 7, 2005 9:56 AM
 

Eric Lippert said:

But in 100-d space, a sphere that has half the volume has 99.3% of the radius! Is the missing 0.7% going to be relevant in any of those dimensions?
June 7, 2005 11:15 AM
 

lollipop said:

Cohomology? Bah. You're a commutative wannabe.

REAL geeks do it with K-theory.
June 7, 2005 4:27 PM
 

grad student said:

lollipop:

As a symplectic geometer with not much on the algebraic side, can you recommend a nice easy intro text to K-theory?
June 7, 2005 5:51 PM
 

jwz said:

Ubergeeks do it over projective varieties!
June 8, 2005 4:36 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Eric Lippert

Eric Lippert is a senior developer on the Microsoft C# compiler team. Before that he worked on the framework of Visual Studio Tools For Office. Before that, he worked on the compilers, runtimes and tools for VBScript, JScript, Windows Script Host and other Microsoft Scripting technologies. He lives in Seattle and spends his free time editing books about programming languages, playing the piano, and trying to keep his tiny sailboat upright in Puget Sound.

This Blog

Syndication


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker