A common general SQL question just popped its head up in the Spatial Forum: when presented with a SQL query, in which order are the parts executed? To be concrete, consider the following table and query:
T:
SELECT x
FROM T
WHERE x <> 0 AND (8 / x) > 5
What should the result of this query be? Focus on the fact that we're chancing a divide-by-zero problem with the second clause of the AND.
If this were C, C++, C#, or Java---languages that short-circuit boolean expressions---then the result depends on the order of execution. Left-to-right seems to be the universal choice, and so the result would be the singleton 1. We avoid the the divide-by-zero: since 0 fails the first clause, the second is never executed.
Some languages don't short-circuit these expressions, and so they will raise a divide-by-zero. Pascal is one example. The order is still left-to-right, but it doesn't really matter since they all get executed.
So what actually happens in SQL? SQL is that rare language that does short-circuit these expressions, but in which the order is not guaranteed. As a result, we know this query will either yield a singleton 1 or an error, but we don't know which one. This is done so that the optimizer can be clever about reordering operations to improve performance, but the result can often be undesirable.
What's a poor developer to do? All that's possible is to avoid the situation. Keep in mind that in many cases there isn't an issue, but if the order is important, then the key is usually to use the CASE statement, for which execution order is guaranteed. This can be a bit clumsy:
SELECT x
FROM T
WHERE 8 / (CASE WHEN x = 0 THEN null ELSE x END) > 5
This is strange, but it is how it is.
Cheers,
-Isaac
Hi Folks,
In previous posts about our new geography type, I've discussed ring orientation and too-large polygons, but there is much more information to mine. I'm going to continue here with a discussion of edges. This post may be a little dense for those uninterested in how the underlying spatial sausage gets made, but it can have important consequences. I also think it's just interesting.
By an edge, we mean the curve that connects two consecutive points in a line string or polygon. Agreeing on this definition is important when we try to find out, for example, where two paths intersect, whether a point is on or near a path, or what the area of a region is. If two systems do not agree on this definition, they may have significantly different answers to such problems. That's why standards that specify these definitions are important; for geographic systems these standards are notably absent.
You may think of edges on the plane and wonder how there could possibly be a problem. On the plane there simply isn't an issue: we all naturally define an edge as a line segment, and we all agree on what this means. When we move to a curved surface things get a bit trickier, since we clearly cannot continue to use a line segment. What is the right choice?
Spheres have a natural choice as well: a great circle. A great circle is a circle defined by the intersection of a plane with the sphere so that the plane cuts through the center of the sphere. If we choose two points that are not antipodal---that do not lie exactly opposite themselves on the sphere---then there is only one great circle that contains them. (This is a corollary of the fact that it takes three non-collinear points define a plane.)
If we choose the shorter path along this great circle, we arrive at the great circle path between the two points, and this is the typical curve used as an edge on the sphere.
These great circles are related to line segments on the plane, in that both are instances of geodesics for their respective domains. Geodesic is a fancy (and concise) way of saying a locally path-minimizing curve. This, in turn, is just a fancy (and mathematical) way of saying that a taught string on a surface between two points will fall into such a path.
A key question is whether such curves are unique. If they are, and we use them to specify an edge, then the definition of the edge will be unique. If not, then we need some way to disambiguate between them, and this can quickly get ugly: Which should we pick? Is there a clear and consistent way to pick? Should the user have a choice? How should the user specify the choice?
On the plane these geodesics are unique: there is only one line segment between any two points. On a sphere they are also unique, so long as the points are not antipodal. I.e, It's unambiguous which direction to go if I ask you to walk directly from Beijing to London---even if you end up walking more north than you might expect---but you can go any which way you want if I ask you to walk from the north pole to south pole. To eliminate this problem we can simply prohibit any case in which two successive points are antipodal.
This, however, doesn't fully solve our problem with geography: we don't work on a sphere, we work on an ellipsoid, and the properties of an ellipsoid are surprisingly different than the sphere. E.g., consider two points exactly opposite each other on the equator. As we've discussed, the edge between these two points is ambiguous on a sphere: there are an infinite number of them, each running in a different direction around the sphere. On an Earth-like ellipsoid that is somewhat fatter than it is tall, there are only two: one that runs through the north pole, and one that runs through the south pole. Any other path will be slightly longer.
Initially, our problems seem a bit less severe, as our ambiguity has been reduced from an infinite number of paths to two. Of course, this post wouldn't be so long if the situation were better: they are indeed are far worse. On the sphere, this ambiguity only occurs when points are exactly opposite each other---move them an iota away and the path is unique again. With the ellipsoid, if we move our two points somewhat towards each other, but leave them on the equator, we still have two paths: one that arcs north and one that arcs south.
It's not just points on the equator that have this problem: the class of such points is much larger, and it is much harder to see how to avoid these cases.
So, what's to be done? Our solution, which is not unique to us, is to define our edges as great elliptic arcs. These are derived just like great circles, but by intersecting a plane with the ellipsoid instead of a sphere. Using a great elliptic arc has a number of benefits: as with great circles, they are unambiguous except between antipodal points; they match users expectations; and they lead to some very interesting computational benefits (which I'll discuss in a future post). One drawback is that they are not true geodesics, but it is unclear how to develop a simple interface using geodesics.
To illustrate how vastly different various edge differences can be, again consider two points on the equator and almost opposite each other. The great elliptic arc between these two will, as with the great circle, follow along the equator, but a true geodesic will go north (or south, its ambiguous) near the pole. In most cases the difference between the two will not be so great.
We certainly think that this definition is a very good one, but there does not appear to be any standard in this area. This is unfortunate, but not completely surprising given the fact that most systems work primarily on flat coordinate systems. Those systems that do work with geographic coordinates seem to vary in their approach, and while some appear to use the same definition we do, some do not. (Worse, of course, are those that aren't even clear about what they do.)
We'd love to build consensus on this issue, and our chief expert on everything geographic, Michael Kallay, published a very well-received argument argument for this definition in last year's ACM GIS conference.
Cheers,
-Isaac
Hi Folks,
I've been asked a few times how to find out what spatial columns are defined in a database. We don't have any special table for this, but you can easily find out by looking at the usual system views:
SELECT ta.name as table_name, co.name as column_name
FROM sys.tables ta JOIN sys.columns co
ON ta.object_id = co.object_id
JOIN sys.types ty
ON co.user_type_id = ty.user_type_id
WHERE ty.name = 'geography' OR ty.name = 'geometry'
There's nothing special about spatial here: you can replace the type names in the WHERE clause of the query with any other type you'd like to find as well. For example, a simple change finds all integer columns:
SELECT ta.name as table_name, co.name as column_name
FROM sys.tables ta JOIN sys.columns co
ON ta.object_id = co.object_id
JOIN sys.types ty
ON co.user_type_id = ty.user_type_id
WHERE ty.name = 'int'
Cheers,
-Isaac
[16 April 2008]: Updated to correct a typo in the first query.
Hi Folks,
Perhaps it is surprising, based on the content here, that I'm not a full-time spatial head here in SQL Server land. (We leave that job to Ed.) I have, of course, been spending a lot of time on spatial, but it turns out I have other responsibilities as well, particularly around the broader type system. One of my jobs is to take a look at issues that come in through Connect that have to do with the type system and make sure they get handled appropriately.
(If you don't use Connect, please do! Issues filed there funnel directly into our bug-tracking system, giving you a direct line to the engineering team. It is the place to go with that pesky bug or feature request.)
These Connect issues are an interesting mix. some good, solid bugs, and some not bugs at all. It's these latter ones I find most interesting, since the majority of them point out quirks in our behavior that, while correct (and often even documented) are still somewhat confusing. I'm going to try to start writing a bit more about common issues here in the hopes of clearing at least a little of that confusion.
Which brings us to implicit casts. An implicit cast is a cast that happens without any explicit direction from the user. If you use CAST or CONVERT, you're using an explicit cast, and that's not what we're talking about here. We're talking about code like this:
SELECT '1'
UNION
SELECT 37
Since the UNION operator has to produce a consistent result, the types of each of its inputs must be the same. Of course, that's not the case here: the first SELECT yields a varchar, the second an int. Faced with this, the system inserts an implicit conversion, attempting to coerce the varchar to an int.
But why convert the varchar to an int and not the other way around? After all, it could convert 37 to '37' instead. First, we have to understand that while this case is simple---the values are literals---this need not be the case. and so it may be very hard for the system to determine up-front what the actual values are. E.g.:
SELECT T.a -- a varchar column
UNION
SELECT U.b -- an int column
Or worse:
SELECT M.a FROM -- some horribly complex query returning a varchar
UNION
SELECT N.b FROM -- another equally horrible query returning an int
In order to base this decision on the actual data, SQL Server would have to accumulate each side of the UNION, inspect all of the values, and hope to find a type that each of the values could be coerced to. Perhaps worse, if the type depended on the actual data retrieved, then it could be very hard to determine what the result type of the query would be; it could even change if the data were updated.
So SQL Server doesn't do this. Instead of basing the conversion on the data itself, it bases it on the type of the data. So, when the server looks at any of these examples, all it really sees is:
SELECT <varchar>
UNION
SELECT <int>
The server then decides which one to convert, in this case it converts the varchar to an int. How does it make this choice? It simply has a precedence list of types, which we can see in the Data Type Precedence topic of Books Online. The abbreviated version is:
1. user-defined data types (highest)
2. sql_variant
3. xml
...
16. int
...
27. varchar
...
Int lands higher than varchar, and so the varchar is converted to an int. The actual data returned by each select is not used at all in making this decision, only the types. The system can make this decision with relatively simple logic, and the type of the result is consistent as well.
This brings us to our pop-quiz. What will this do?
SELECT 'a'
UNION
SELECT 37
I'll give you a minute...
Got it?
Okay, here's the result:
Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value 'a' to data type int.
Make sense?
Cheers,
-Isaac
Hi Folks,
In my last indexing post, I filled in most of the details about our multi-level grid index. Let me clean up a few lingering questions about our planar grid. We'll do this Q&A style:
Q:
What happens if I set the maximum number of cells-per-object so small that I cannot obtain a covering of my object?
A:
To make this concrete, let's take a look at one of the pictures from last time:
In this schematic, there are four top-level cells, and we can see that we need at least three cells to tile the object. In this picture, we are using 13 cells, so we've set the maximum number of cells-per-object to at least that. But what if we set it to two?
The answer is that in the case where the number of first-level cells needed to tile the object exceeds the max cells-per-object, we violate cells-per-object constraint. This is really the only situation in which we'll do that, but it's necessary in order to ensure that we tile the object, and this is required in order to guarantee correct query results.
Keep in mind that in a real index, the number of top level cells is either 16, 64, or 256, and this will only happen if the max cells-per-object exceeds that value.
Q:
What if I have an object that falls outside, or partially outside the bounding box I have defined? Will I miss results?
A:
There is an implicit extra cell in the tessellation, what we call the root cell. The root cell covers everything outside of the bounding box you've defined, and will be used to cover any object that touches it. We can use this to guarantee correct results even if your object isn't well contained by your bounding box. What will suffer is performance, and so it is still important to get the bounding box correct.
Q:
Are there times where you can answer the query using solely the index?
A:
Yes. Consider a cell that is touched by one object, and completely covered by another. We can tell that these two objects must intersect based solely on this fact: there is no need to actually run the intersection predicate. We call this "intermediate filtering" (not the best name, but we needed something to call it) and we do implement it. To do this, we store whether the object touches or covers the cell in the index row.
Q:
Which predicates are supported on an index?
A:
It depends on the type. For geometry, we support STIntersects, STEquals, STOverlaps, STTouches, STWithin, STContains, STDistance, and STFilter. On geography we only support STIntersects, STEquals, STDistance, and Filter. We don't support the others simply because they don't exist on the type.
Q:
What is the difference between STIntersects and Filter?
A:
If there isn't an index, they do exactly the same thing. This can happen on the client, or on the server if you haven't defined an index (or if the index isn't chosen for some reason). When there is an index involved, Filter will not do any secondary filtering. I.e., it is only guaranteed to return a superset of the results an STIntersects predicate will return: it may return extras.
This is useful for applications that are simply displaying results, are not sensitive to some extras, and want to avoid the cost of the secondary filter. If you need precise results, use STIntersects.
If there are more questions, please ask, and I'll answer them here. So far we've only talked about geometry, so next time I'll either fill in the gaps around geography or answer more questions.
Cheers,
-Isaac
Hi Folks,
I recently got contacted via email with the following problem:
...
I have 2 complex polygons, representing district boundaries. The polygons look correct, but I'm getting exceptions when I try to create the type.
I've attached the wkt polygons to the email.
These polygons are stored in a table, but I've tried it without the table and get the same result.
Here's the sql I'm trying to execute:
declare @g geography;
set @g = geography::STPolyFromText(@wkt, 4268)
select @g.ToString()
where @wkt is replaced by the correct polygon string.
...
Now, I should point out that this is a very nice error report. They've told me what their data is, they included their data, and they told me very clearly what they're trying to do. (Perhaps knowing what version they're using would good, but I'm getting picky.)
But here's the kicker that saved me any debugging:
...
Here is the error message I receive:
A .NET Framework error occurred during execution of user defined routine or aggregate 'geography':
Microsoft.SqlServer.Types.GLArgumentException: 24205: The specified input does not represent a valid geography instance because it exceeds a single hemisphere. Each geography instance must fit inside a single hemisphere. A common reason for this error is that a polygon has the wrong ring orientation.
Microsoft.SqlServer.Types.GLArgumentException:
at Microsoft.SqlServer.Types.GLNativeMethods.ThrowExceptionForHr(GL_HResult errorCode)
at Microsoft.SqlServer.Types.GLNativeMethods.GeodeticIsValid(GeometryData g)
at Microsoft.SqlServer.Types.SqlGeography.IsValidExpensive()
at Microsoft.SqlServer.Types.SqlGeography.ConstructGeographyFromUserInput(GeometryData g, Int32 srid)
at Microsoft.SqlServer.Types.SqlGeography.STPolyFromText(SqlChars polygonTaggedText, Int32 srid)
I looked at all the nodes, and they are all in the same hemisphere, so I'm a little confused with what might be going on.
...
So, what's going on? Ring ordering.
I've posted on this before, but let's recap. On a round Earth, defining a polygon by a ring is ambiguous. Consider the polygon defined by a ring around the equator: does this describe the northern or southern hemisphere? A small ring could describe either, say, a small island in a large ocean, or the large ocean itself.
To get out of this pickle, we disambiguate with a standard trick: we take ring orientation into account. We simply define the left side of the ring as it is drawn to be inside the polygon.
If we put this together with our current (unfortunate) limitation that we do not deal with objects that exceed a hemisphere, and it becomes clear what's going on. While the polygon looks like it's nice and small, the rings are inverted, and so it's actually quite large. We can't handle it: cue the exception.
I describe this hemisphere limitation as unfortunate, and it remains one of the biggest items the spatial team would like to fix up, but in this case it has actually helped catch an error.
I'll talk a bit more about our hemisphere limitation in a future post. It's not quite as simple as we let on.
Cheers,
-Isaac
Hi Folks,
We don't call them betas, but that's basically what CTPs are: they let us suss out problems before we drop a final product on everyone. Well, we've found a regression in the February CTP spatial support that we'd like to let you know about.
Essentially, a costing problem was introduced that leads us to choose poor plans for spatial queries. This seems to usually manifest itself by choosing merge joins instead of loop joins between the spatial index and the base table, and that generally doesn't work too well.
Why is a merge join such a bad plan in this case? If you've been following my indexing posts, you'll understand that we generally end up with a small number of rows that pass our primary filter (our index). We have to then join those back to the base table to pull out the spatial object to which the index row refers.
This is usually best done through a loop join that runs over each index row and seeks into the base table for each of the objects. However, if a merge join is chosen, then we risk scanning the entire base table to feed into our merge. This is usually not a wise choice.
What can you do about this if you're running the February CTP? A good general fix is to force loop joins in your query. Perhaps the simplest way to do this is by adding an "OPTION (LOOP JOIN)" to the end of your query.
On the positive side, this has already been fixed, so you should see better plans in our next public release.
Cheers,
-Isaac
Hi Folks,
I just thought I'd take a few moments clarify the upcoming coordinate order swap for the geography type. Here's a quick FAQ on the issue:
- What exactly is the change?
We are swapping the coordinate order for well-known text (WKT) and well-known binary (WKB) formats from latitude-longitude to longitude-latitude. E.g., a point in the Seattle area might be represented in WKT as "POINT(44 -122)" in the current ordering; in the new order it will be "POINT(-122 44)".
- Why are we making this change?
Because of customer feedback. While common practice almost universally uses latitude-longitude ordering, the de facto standard for WKT and WKB is to use a longitude-latitude order. Making this change will make it much easier for customers to load WKT and WKB data into the geography type.
- What about GML?
The standard paractice for GML is to use latitude-longitude ordering, and so no change will be made for GML.
- Is the on-disk format changing?
Nope. Ths is just a matter of changing our routines that import and export data. Nothing beyond those routines will change.
- What about geometry?
There is no change to the geometry type. Geometry deals with objects on the plane, and is pretty agnostic about the coordinate order outsidfe of the STX and STY properties.
- When is the swap happening?
It is not the current Februrary CTP, but the team is working on it now. Expect to see it in our next public release.
If there are more questions, I'll be happy to answer them here.
Cheers,
-Isaac
Minor Update: I originally got the WKT examples at the top wrong, placing commas between coordinates. The examples have been updated to their correct, comma-free state. (Thanks, Steven!)
Hi Folks,
Well, it looks like Ed finally started blogging. (I'll try not to let him take too much pressure off me and slide worse than I have.) In his first post, he reveals what new spatial functionality is available in the most recent CTP.
We have one more big trick up our sleeves, but for now you'll have to stay tuned.
Cheers,
-Isaac
Hi Folks,
Last time, we highlighted several problems with a simple grid index. If you don't recall---and since it's been a while, that wouldn't be a surprise---you may want to review them. In this post I'll start to describe how we get around them.
In SQL Server, we don't use a simple grid like the one described. Instead, we use a multi-level grid, which ends up looking very much like a quad tree. Basically, instead of having one grid level, we have four, with each lower-level grid nested in the one above.
Let's look at an example. If we have a 2x2 grid at each level, then we can tile the object below with three tiles at the top-level grid:

These tiles cover the object, which is a property we want to maintain for our indexing scheme, but there's a lot of extra coverage. This extra means we're likely to have a good number of false positives from our primary filter. We can go to a second-level gridding instead:

At this level, we've trimmed some of the extra slop from the edges, but our description of the object has become more complex as well. With our four levels, we can push this even further---here's what a fourth-level tiling of the object would look like:

This description is nice and tight---there's very extra covering of the object---but it's also very complicated. What's key to keeping the complexity down is that we don't require that all object use the same level tiling, nor do we require that all tiles for a particular object all be at the same level. We can mix as we wish.
The ability to mix levels in our tiling leads us to two optimizations. The first, which is always used, is that if we find that all subtiles of a particular tile are touched by the object, then we won't further subdivide the cell. Subdividing gives us nothing but more cells. Perhaps this is clearer visually; below is a picture of the same tiling above with this optimization applied:

As you can see, this greatly reduces the number of tiles we have, and does so by mixing levels. A second way to reduce tiles is through an explicit limit. We default this limit to 16, but the user can tweak it to their liking. When we decompose an object, we do a breath-first walk down the tree decomposing tiles, and we stop when we hit the pre-defined limit. E.g., we might end up with:

There's much more to say about this, but before I leave it for now, let me point out that while these examples use 2x2 decompositions at each level, we use more. The exact amount is, in fact, adjustable: the user can set each level to low (4x4) medium (8x8) or high (16x16).
How should the user tweak these various parameters? It's hard to give general advice, as it's a very data- and query-dependent calculation, but we feel that we have picked some defaults that will work well for most people. We may have better advice as more people start using the index and we get reports on what worked for them. (Yes, we're begging you to send us your findings!)
Next time we'll continue with some more details about the mult-level grid.
Cheers,
-Isaac
Hi Folks,
Well, I never did complete my series of posts describing our indexing scheme. I guess I ought pick it up again now.
We've gone through some preliminaries so far, describing why we'd like spatial index, what a simple indexing might look like, and how we would use that simple scheme internally. Over the next few posts, my plan is to describe the actual scheme we use and give some idea how it is used internally. I'll start with the geometry type, and then cover geography. There's more alike than not.
First, though, let's look back at where we've been.
Recall our simple grid index: we chose a bounding box, and then broke up space into a set of tiles. These tiles were numbered using some scheme. (In our example, we just numbered them in row-major order.)
We can find at least a few points to critique1:
- The tigher the bound around an object, the better the primary filter will perform, but small objects---say, points---may not be covered very tightly by a cell.
- Large objects may have the opposite problem: the fraction of a set of covering cells that actually covers the object may be large, but it may take a lot of cells to do so. If we have to persist each of those cells, our index is going to get quite large.
- The index numbering we used doesn't preseve locality very well. I.e., two close objects by several cells with wildly different numbers. This has implications for performance, as local spatial operation may have to read values from all over the index.
Given that we want to stick with a tiling scheme, we want to find some scheme that will address these issues. I should point out that there is no perfect solution to issue (3), but we would still like to do better than we have.
Next time: Our Basic Multi-Level Grids
Cheers,
-Isaac
1 There are more issues we could pick on, but for the moment, I'm only going to choose the issues we're going to fix.2 We'll come back to some others later in the series.
2 I may have inadvertanly missed some issues that we will fix as well. As you can probably tell, it's not like I've planned out these posts too carefully...
Hi Folks,
Oops! I missed another neat tool in my last post. Simon Sabin has created a utility called SpatialViewer that lets you draw geometries. Better yet, he's posted the thing to Codeplex so you can futz around with it.

I'm always looking for more fun tools to use with SQL Server Spatial... :)
Cheers,
-Isaac
Hi Folks,
I'm finally getting back in the rhythm of things after the hollidays. While I work on my next spatial indexing post, I thought I'd highlight a few very cool tools that people have built to work with SQL Server Spatial.
Morten Nielsen has put together a couple nice utiliites. The first let's you load Shapfiles directly into a SQL Server table, and there's a great deal of data in Shapefile format. E.g., I was able to very easily read in census block group data through Morten's loader. This is really a breeze: you point it at your instance, turn a few simple knobs, and away it goes:

Once the data is loaded, you can paste in a SQL query and use Morten's other utility to view it.

Craig Dunn has also been working on some nice visualizations. His Geoquery application can execute spatial queries against SQL Server and display their results. It also includes a really nice set of examples that demonstrate various spatial operations. E.g., here's a buffer:

Right now, the geography representations aren't quite right---they map edges to lines on the plane rather than great circles---but it looks like this is due for a fix in v0.7. I can't wait!
Cheers,
-Isaac
Hi Folks,
I guess I wasn't paying enough attention while I was on my end-of-year vacation, since I read right over the last paragraphy of Paul Ramsey's recent post:
That kind of community responsiveness practically cries out for... a five pound box of chocolates! Note to Isaac and company: don't eat them all in one sitting, no matter how tempting it might be!
And so I should not have been surprised when an unexpected package arrived in my office yesterday:

Wow---that's a big box of chocolate! I've never received such a gift for a rectified mistake before. (It makes me wonder what we should do next: remove SRID support, or disable STIntersects? Do we get another box if we add them back in? ;) )
Well, for the entire team, thank you Paul. (We may change our tune after we've gorged ourselves.) Our number one goal is to do the right thing for our customers, but we aren't perfect. Thanks for helping us suss this out before we RTM.
(Now we've just got to get the fix in...)
Cheers,
-Isaac
Hi Folks,
If you've been following the spatial forum over at MSDN, you've probably run across this thread. In short, we've ignited quite a controversy with the coordinate ordering for the well-known text and well-known binary we use for the geography type: we use latitude-longitude; many would prefer longitude-latitude.
While I don't buy into many of the arguments presented, there is one I do very much believe in: usability. As it would appear that the vast majority of data in WKT and WKB formats use longitude-latitude ordering, we probably ought align to ease the transfer of data in and out of SQL Server.
We'll be trying to make this switch in an upcoming CTP. There will be no change in the next CTP---that ship has unfortunately sailed---but we hope to make this change before we RTM. I want to be clear: this is not a done deal, and it's possible our schedule will derail us, but we'll likely make this change.
We do not intend to change our coordinate ordering anywhere else. GML, both in theory and practice, appears to use a latitude-longitude ordering, so we will continue to do so.
As always, I welcome comments either here or on the forums.
Happy holidays!
Cheers,
-Isaac