As part of the blog series TSQL - Solve it YOUR Way, today's topic is a variation of the popular T-SQL question titled "Gaps and Islands" followed by three different solutions presented by three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Alejandro Mesa, and Tom Cooper.
Gaps and Islands questions revolve around sequential data with missing values. The term "gaps" refers to the ranges of missing values, while the team "islands" refers to the ranges of existing/sequential values. This topic takes on many forms and shows up occasionally in SQL Server interviews, with variations such as the following:
Topic: Given a series of integer ranges representing painted segments of a bridge, find the areas of the bridge where there is no paint, as well as the areas where the paint has been overlapped.
Recently, a question was asked in the TSQL MSDN forums that was a variation of the gaps and islands question, with a small twist. The question revolves around a bridge of length 100 that has been painted in various segments as shown with the following data:
For this exercise, we need to determine how to return the gaps (the areas where the bridge is NOT painted) as well as where the paint has been overlapped.
Explanation of Steven's solution:
As documented in the comments throughout the solution, Steven takes the following approach.
The beauty of this approach is that it could easily be adapted to report on multiple bridges or roads, whatever it is that you are painting, as in the following modified example.
Alejandro's Explanation:
I see here two problems in one. The first one would be finding overlapping ranges, and then identifying gaps. Itzik Ben-Gan included these two problems in his last book about window functions. He explains the solutions for SQL Server 2012 and also for previous versions in the following book: Microsoft® SQL Server® 2012 High-Performance T-SQL Using Window Functions http://shop.oreilly.com/product/0790145323088.do I am going to use his approach to tackle your question since I didn't see any solution using the new support to the OVER clause (window functions).
Finding gaps using LEAD function is simple and I think that reading the code is enough. For the solution about packing overlapping ranges, the key is in splitting the range (each row) in two sets, one for the initial of the range (FromLen) and another for the end (ToLen), and adding a type (+1 or -1) depending if it is the start (FromLen) or the end of the range (ToLen). We are going to calculate the running total of [type] minus the [sub], and we will be interested only on those where the formula ([cnt]) is zero. This will give us pairs of start and end, and we can identify the pairs by enumerating those rows and using ((ROW_NUMBER() OVER(...) - 1 / 2) + 1), the rest is grouping and calculating the MIN and MAX per group.
Note: I didn't include an ORDER BY clause for presentation purpose, neither setup proper indexes to support the OVER clause.
Tom's Explanation:
As noted in the original forum post, this is essentially an “Islands and Gaps” problem. The desired result is all the islands of unpainted points. But the input data is somewhat atypical. In a typical islands and gaps problem, the data is a series of points (for example, you have a table where each row had two columns – employeeid and sickday (datatype date)). And you want a report show the ranges of days an employee was out sick. So that you could produce a report like Employee 5 was out Jan 5 thru Jan 7. There is a good deal of information on how to solve this type of problem available both online and in books. One good discussion is by Itzik Ben-Gan at http://www.manning.com/nielsen/SampleChapter5.pdf.
So if the input data was a set of rows showing all the unpainted points, using that technique would solve our problem. But instead, the data is a series of ranges of painted points (possibly overlapping). So we want to first find all the points which are not in any of those ranges. To do this we need to first get all numbers between the minimum and maximum points (1 and 100 in this case). We do that either by using a Numbers table that was previously created or by building a Numbers cte as part of the query. I choose to build one with a cte since that way I did not have to assume the OP had a permanent numbers table. The method I used is a common one, the earliest reference I know to it is also by Itzik Ben-Gan at http://www.sqlmag.com/article/sql-server/virtual-auxiliary-table-of-numbers. Once I had this table of numbers I then kept only the numbers which were in the range 1 to100 and were not in any of the painted ranges in the original data. What’s left is a standard islands and gaps problem.
As you can see, all three of the above solutions provide the result we were looking for, but do so in a very different style. The original thread provides variations of the solutions presented here as well as further discussion. Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.
Special thanks to Steven, Alejandro, and Tom for their valuable forums contribution and for contributing to this series!
Hope that helps,Sam Lester (MSFT)
Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.
Blog: www.MSBICOE.com | LinkedIn: Steven Wang | MSDN Profile: Steven Wang - Shangzhou
Alejandro Mesa has been working with SQL Server for more than 10 years, being an MVP since 2007. His passion is T-SQL and he is active in both English and Spanish forums under the alias of Hunchback.
Tom Cooper began his programming career in 1968, began working with database software in 1977, and first worked with Microsoft SQL Server in 1994 (version 4.21). He is now very happily retired.