Nearest Neighbors

Nearest Neighbors

  • Comments 16

Hi Folks,

Spatial users often want to find the object nearest a given point.  This operation, usually referred to as nearest neighbor search, is remarkably common in many areas of computer science.  In general, we may wish to find not only the nearest, but the k-nearest neighbors.

How can we accomplish this with SQL Server?  Here we’ll look at finding the single nearest neighbor; the extension to k-nearest neighbors is relatively straight forward.

First, let's examine the naive method for accomplishing this: simply order the table by distance and restrict the results.  For all of these examples, we’ll assume a table T with a spatial column g, as well as a parameter @x containing the search point:

SELECT TOP(1) *
FROM T 
ORDER BY g.STDistance(@x) ASC

This solution certainly has simplicity on its side, but consider the work that needs to be done.  The entire table must be scanned, and the distance of each to the search point must be calculated.  Ouch.

We could conceivably improve on this by restricting our search space to to the immediate region around the target point:

DECLARE @region geography = @x.STBuffer(10000)
 
SELECT TOP(1) *
FROM T 
WHERE g.Filter(@region) = 1
ORDER BY g.STDistance(@x) ASC

But this solution requires that we know our data very well: if there are no rows in the region, then we will fail to find the nearest neighbor; if there are too many, then we will again be left with a rather inefficient query.

How do we escape this morass?  We can do so by starting with a very small region---so small that we can be certain not to encounter too many results---and then keep enlarging it until we find something.  Doing this with a loop is not hard, but Steven Hemingray showed me how to do this with entirely declarative syntax:

DECLARE @start FLOAT = 1000; 
 
WITH NearestPoints AS
(
   SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist
   FROM Numbers JOIN T WITH(INDEX(spatial_index)) 
   ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
   ORDER BY n
)
SELECT TOP(1) * FROM NearestPoints
ORDER BY n, dist

This requires some explanation.  First, the @start parameter gives the initial region to search.  I’ve chosen one kilometer, but this can be adjusted downward if your data is very dense.  Second, you’ll notice that we make use of a Numbers table, which just contains the numbers 1 through n.  This  just contains a long list of integers, which is is useful in many situations.

The inner query examines a set of exponentially-expanding regions.  The ORDER BY clause along with the TOP(1) allows the query to stop as soon as it finds the smallest non-empty region.  The WITH TIES statement makes sure that all of the objects in that region will be in the result set.

Once the inner query returns a list of potential results, the outer query examines them to find which is actually nearest.  With this approach, we can select a start area small enough to keep the cost low in dense data, but also be guaranteed to find a distant nearest neighbor.

Incidentally, if you don’t already have a numbers table, you can create one quite quickly with some mildly-black magic like this:

SELECT TOP 100000 IDENTITY(int,1,1) AS n 
INTO numbers 
FROM MASTER..spt_values a, MASTER..spt_values b
 
CREATE UNIQUE CLUSTERED INDEX idx_1 ON numbers(n)

This isn’t a particularly pretty solution, but to proactively answer a question, we didn’t add a method for this primarily because we ran out of time.  Look for something more built-in the next go around.

Cheers,
-Isaac

  • Very innovative! (and I trust, more efficient that STDistance() - if I find the time I might do some timing tests of the three methods)

    But, ouch, that code is not very accessible unless you've got fairly advanced T-SQL knowledge.... I suspect many users for that reason will still end up using options 1) or 2). (Or perhaps, copying your code for 3) verbatim.)

    Would definitely be good to have this sort of thing built-in. Keep it on the wish-list! (But, do something to address the lack of spatial features in SSIS and SSRS first.)

    Keep up the good work!

  • Funnily enough I came to tackle the exact same problem myself this morning, in SQL Server 2005. I've written a quick blog entry of my own: http://blog.webnet.org.uk/2008/10/nearest-neighbour-in-sql-server-2005.html

    If I were doing the expanding-box model (my personal favourite) I think I'd probably do what I've done before with other search algorithms, which is to grow the bounding box exponentially. (I don't remember why I was doing it that way instead of a binary tree, but there we go!)

  • Hmmm... interestingly, I'm not finding conclusive empiricial evidence to suggest that this approach is necessarily faster than the other two - I'm only doing some quick tests so my trials are not necessarily random or fair - has anybody else tried this?

    It will be hard to prove conclusively, since I guess the speed of each approach depends heavily on the actual geographic spread of data, the properties of the index(s) applied, and the values chosen for the Buffer Size in option 2) and the starting value @start in option 3) (which, like the buffer size, must be set to a fairly arbitary value)...

  • Interesting.  Let me try some simple tests.  For this, let's call the three methods above "naive", "limited", and "fancy".

    My a priori expectation would be that the time for the last two methods---limited and fancy---would be comparable, assuming the initial search size were the same.  But keep in mind that the limited solution is not guaranteed to produce a result.

    I did a quick comparison of the three methods on three data sets: a table of 11,427,230 US businesses, a table of the 212,942 census block groups, and a table of 32,108 zip code polygons.  For both, I looked for the nearest neighbor of the point 44 N 123 W  (POINT (-123 44)).  The results are from my dual-core laptop, and all elapsed times in milliseconds:

    Businesses

    ----------

    Naive: 357235

    Limited: 7

    Fancy: 18

    I should note that the first time around, the limited solution failed to produce a result, since the nearest point was ~30 km away.  To make the results more fair, I moved my search point so there was an object within a kilometer, and dropped the search radius to 1 km.  This illustrates the fragility of the method.

    Also, these timings of exactly one run---after warming the caches, of course.  

    I kept the same search parameters for the census and zip-code data:

    Census

    ------

    Naive: 65367

    Limited: 16

    Fancy: 24

    Zip Codes

    ---------

    Native: 18921

    Limited:  62

    Fancy: 94

    So, at least on some of my data, the limited solution seems the fastest (when it works), but both the limited and fancy solutions vastly outperform the naive implementation.

    Cheers,

    -Isaac

  • The efficiency does greatly depend on the density of the data. Imagine if we were trying to find the nearest gas station in, say, America: We could check to see the longest distance between two fuel depots and divide that by two, then use that for a bounding box. But while that's likely to return us three or four points from a long freeway, it's going to return huge numbers of points within a city such as New York.

    Meanwhile, if we grow the box gradually, we'll find our point quite quickly within New York but we'll be crawling outwards for quite some time when we're on that freeway.

    The exponentially expanding box may slow down a search in a densely populated area, but it's also very likely to speed it up in a sparse region.

    At the end of the day, it all depends on the data you're looking through. If you have fewer than 500 points, the Brute-force/Naïve method is probably as much work as needs to be put in. For an all-over dense or all-over sparse search, the Limited method would probably do fine with appropriate parameters. But for a mixed data set, I'm still convinced the "Fancy" method will provide the most pleasing overall result.

  • Interesting approach, however how could this be extended for a result of n nearest neighbors?

  • To Craig - you mean if you want n nearest neighbors for each record in table a.  I think you would have to introduce a CROSS APPLY in there.

    Isaac,

    Interesting. This is similar to the approach I came up with a while back for PostGIS that I was hoping to port to SQL Server 2008 and fix some bugs in the current implementation and upgrade to the new PostGIS syntax.

    http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_nearest_neighbor_generic

    It is a pity that SQL Server doesn't have a built in generate_series function like PostgreSQL does, though I guess one could create one, just not sure how efficient it would be. The PostgreSQL one seems pretty efficient.  The dummy number /generate_series hack is probably my favorite SQL trick of all time.  The use-cases are limitless.

    http://www.bostongis.com/blog/index.php?serendipity[action]=search&serendipity[searchTerm]=generate_series

  • Hi Isaac,

    Can this be modified to find all neighbors within x radius? For example, I want to find all of Microsoft's neighbors within 100 km...

    Even with a spatial index, my query is slow because half the time, according to the execution plan, is spent on the where clause.

    This is my current query:

    SELECT count(*)

    FROM dbo.Company WITH(INDEX(six_Company_geography))

    WHERE g.STDistance(0xE6100000010CEE7C3F355ED24740D578E92631885EC0) <= 100000

    Thanks,

    Bob

  • Craig,

    I'll try to post an n-nearest neighbor extension soon, but yes: this can be extended.

    Bob,

    Your problem is easier, and the query you post is the right one.  I'm curious how slow your query is, and how many results you're finding.  

    Keep in mind that if there are many results, we may not be able to do it terribly fast, since we have to actually calculate the distance for many of them.  The index will help, both in eliminating the results that are way out of bounds, as well as reducing the distance calculations for those that are very close to the center point, but anything near the boundary will have to be calculated.

    Cheers,

    -Isaac

  • Thanks for the quick reply Isaac. My query is taking 6 seconds and returns approx. 170K companies when searching for companies within 100km of Microsoft.

  • Hi Bob,

    I'm not terribly surprised by these results for the reasons I mention above.  Out of curiosity, though...

    First, for these queries, declare the following:

    declare @point geography = 0xE6100000010CEE7C3F355ED24740D578E92631885EC0

    declare @circle geography = @point.STBuffer(100000)

    With these definitions (really, with @circle) first try:

    SELECT count(*)

    FROM dbo.Company WITH(INDEX(six_Company_geography))

    WHERE g.STIntersects(@circle) = 1

    This should essentially do the same thing your query does.  I'd expect very similar performance.

    Then try:

    SELECT count(*)

    FROM dbo.Company WITH(INDEX(six_Company_geography))

    WHERE g.Filter(@circle) = 1

    This query will only use the index and skip the actual spatial calculations.  I'd expect it to give a greater count than your original query, but run significantly faster because it won't have to compute any actual distances.

    Assuming I'm right, you're kind of stuck with the expensive query.  But at least you know where your time is going.  :)

    Cheers,

    -Isaac

  • One more question....my first test was with all geography values populated in my table. In my actual data, however, there are a non-trivial number of NULL values.

    This seems to affect the number of logical and physical reads significantly. The spatial index seems to be returning the NULL values before the filter removes them?

    Does that make sense?

  • Why is the filter method expensive?

  • Doh! Forget about my first comment re: performance - this method *is* quicker than the brute force approach - I had something funny going on with the index that this approach was using (don't even ask). Still think the Filter()/Buffer() method is useful in several situations though (particularly in applications where nearest-neighbor is combined with an additional distance limit - "Show me all the three closest petrol stations to my house, so long as none of them are further than 25km away", for example)

    I think I might have a slight correction to the code though. Currently, the example is not extendable to include the general case of k-nearest neighbors. If you try the following, for example:

    DECLARE @start FLOAT = 1000;

    WITH NearestPoints AS

    (

      SELECT TOP(5) WITH TIES *,  T.g.STDistance(@x) AS dist

      FROM Numbers JOIN T WITH(INDEX(spatial_index))

      ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)

      ORDER BY n

    )

    SELECT TOP(5) * FROM NearestPoints

    ORDER BY n, dist

    Then you (potentially) get duplicate rows in the output. This is because you might find only one feature that lies within the search area T.g.STDistance(@x) < @start*POWER(2,1). In order to select the TOP 5 features, the query carries on to select those features where T.g.STDistance(@x) < @start*POWER(2,2), which, obviously, includes that same feature again. To correct this, you need to sort the set of candidate features by the largest zone in which they were located (ORDER BY n *DESC*), then by the distance which they were away:

    DECLARE @start FLOAT = 1000;

    WITH NearestPoints AS

    (

      SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist

      FROM Numbers JOIN T WITH(INDEX(spatial_index))

      ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)

      ORDER BY n

    )

    SELECT TOP(1) * FROM NearestPoints

    ORDER BY n DESC, dist

  • Your statement - "The entire table must be scanned" doesn't follow, no more than a table scan is necessary to do "SELECT TOP 1" from a table with a conventional index.

    The problem here is that the query optimiser doesn't recognise that it can use the spatial index to eliminate the table scan.

Page 1 of 2 (16 items) 12