Introduction:

As part of the blog series TSQL - Solve it YOUR Way, today's topic is a variation of the popular T-SQL question titled "Gaps and Islands" followed by three different solutions presented by three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Alejandro Mesa, and Tom Cooper. 

Gaps and Islands questions revolve around sequential data with missing values.  The term "gaps" refers to the ranges of missing values, while the team "islands" refers to the ranges of existing/sequential values.  This topic takes on many forms and shows up occasionally in SQL Server interviews, with variations such as the following:

  • Find your favorite sports team's longest winning streak given the win/loss results
  • Given a table of natural numbers, return a two column table showing the missing numbers (ex:  missing from range 6-8, 11-21, etc)
  • Find the missing Order IDs given an Orders table with sequential Order IDs
  • Write a function that returns the hours worked in the past month by a specific employee

Topic:  Given a series of integer ranges representing painted segments of a bridge, find the areas of the bridge where there is no paint, as well as the areas where the paint has been overlapped.

Recently, a question was asked in the TSQL MSDN forums that was a variation of the gaps and islands question, with a small twist.  The question revolves around a bridge of length 100 that has been painted in various segments as shown with the following data:

For this exercise, we need to determine how to return the gaps (the areas where the bridge is NOT painted) as well as where the paint has been overlapped.

Solution #1:  Provided by Steven Wang 

 

Code Snippet
  1. /****************************************************************
  2. Solution Preparation:
  3. 1. Create a temp table #BridgeP
  4. 2. Create a temp table #BridgeMaxLength to hold the max length of all bridges
  5. 3. Insert some test data
  6. 4. Create nonclustered index to boost the performance for large volume of data.
  7.  
  8. Note: the solution I presented here is inspired from the book:
  9.       Inside Microsoft SQL Server 2008: T-SQL Programming
  10. ****************************************************************/
  11. Create Table #BridgeP(    ID Int Identity(1,1) Not Null
  12.                         , BridgePID int Not Null
  13.                         , FromLen Decimal(14,4) Not Null
  14.                         , ToLen Decimal(14,4) Not Null
  15.                         , Constraint pk_BridgeMeaurement Primary Key (ID)
  16.                         , Constraint check_From_Less_than_to Check(FromLen <= ToLen)
  17.                     );
  18.  
  19. Create Table #BridgeMaxLength ( BridgePID int Not Null
  20.                                 , BridgeMaxLen Decimal(14,4)
  21.                             );
  22.  
  23. Insert into #BridgeP
  24. Values (1, 10.0002, 20.0004)
  25. , (1, 18, 21.0004)
  26. , (1, 11.1234, 20.0002)
  27. , (1, 40, 50.0003)
  28. , (1, 19.11, 35)
  29. , (2, 46.2966, 50.6472)
  30. , (2, 31.7084, 46.1286)
  31. , (2, 178.2604, 190.6864)
  32. , (2, 38.7394, 43.7234)
  33. , (2, 27.266, 43.3031)
  34. , (2, 122.8653, 142.6958)
  35. , (3, 14.7856, 21.6676)
  36. , (3, 67.6054, 68.1884)
  37. , (3, 64.8543, 66.4028)
  38. , (3, 123.331, 134.4094)
  39. , (3, 199.3904, 201.1482)
  40. , (3, 88.1347, 103.9153);
  41.  
  42. Insert into #BridgeMaxLength
  43. Values (1, 100.0000)
  44. , (2, 200.0000)
  45. , (3, 250.0000);
  46.  
  47.   Create Nonclustered Index idx_Bridge_From_To On#BridgeP (BridgePID, FromLen, ToLen);
  48.   Create Nonclustered Index idx_Bridge_To_From On#BridgeP (BridgePID, ToLen, FromLen);
  49.  
  50. /**********************************************************************************
  51. Solution starts here:
  52.  
  53. 1. Since the bridge paint measurement records have overlaps, we first need to identify
  54. all those paint starting points for a bridge. These starting points must not be adjacent
  55. to any end painting points recorded and must be less than any other starting points within
  56. the same packed painting group.
  57. **********************************************************************************/
  58.  
  59. ;With RangeStart
  60. As
  61. (
  62. Select    Distinct A.BridgePID, A.FromLen As RangeSTart
  63. From    #BridgeP A
  64. Where    Not Exists
  65.             (    Select    *
  66.                 From    #BridgeP B
  67.                 Where    B.BridgePID =  A.BridgePID
  68.                         And B.FromLen < A.FromLen
  69.                         And B.ToLen >= A.FromLen
  70.             )
  71. )
  72. /**********************************************************************************
  73. 2. With the same logic as starting point but in a reversed way, the ending points
  74. can be identified as below:
  75. **********************************************************************************/
  76. , RangeEnd
  77. As
  78. (
  79. Select    Distinct A.BridgePID, A.ToLen As RangeEnd
  80. From    #BridgeP A
  81. Where    Not Exists
  82.         (    Select    *
  83.             From    #BridgeP B
  84.             Where    B.BridgePID =  A.BridgePID
  85.                     And B.ToLen > A.ToLen
  86.                     And B.FromLen <= A.ToLen
  87.         )
  88. )
  89. /**********************************************************************************
  90. 3. Since a bridge should always start from 0 and end with the maximum length, we add
  91. these 2 points together with all starting points and end points.
  92. The union all query results must come back with an even number (pairs of start and
  93. end point) for each bridge.
  94. 4. In order to put starting point and an end point pair into the same group. I used
  95. row_Number function:
  96. (Row_Number() Over(Partition By BridgePID Order by GapStart) - 1) / 2
  97. If the row number is 1, it returns back 0; if the row number is 2, it also return back
  98. 0.
  99. **********************************************************************************/
  100. , GapValue
  101. As
  102. (
  103. Select    BridgePID, GapStart
  104.         , (Row_Number() Over(Partition By BridgePID Order by GapStart) - 1) / 2 As MyGroup
  105. From
  106.         (
  107.             Select    BridgePID, 0.0000 As GapStart
  108.             From    #BridgeMaxLength
  109.                 Union All
  110.             Select    BridgePID, RangeStart
  111.             From    RangeStart S
  112.                 Union All
  113.             Select    BridgePID, RangeEnd
  114.             From    RangeEnd E
  115.                 Union All
  116.             Select    BridgePID, BridgeMaxLen
  117.             From    #BridgeMaxLength
  118.         ) X
  119. )
  120.  
  121. Select        BridgePID, Min(GapStart) As GapStart, Max(GapStart) As GapEnd
  122. From        GapValue
  123. Group By    BridgePID, MyGroup
  124. Having        Min(GapStart) <> Max(GapStart)
  125. Order by    BridgePID, GapStart;
  126.  
  127.  
  128. Drop Table #BridgeP;
  129. Drop Table #BridgeMaxLength;

 

Explanation of Steven's solution:

As documented in the comments throughout the solution, Steven takes the following approach. 

    1. Since the bridge paint records have overlaps, we first need to identify all of those paint starting points for the entire bridge. These starting points must not be adjacent to any end painting points recorded and must be less than any other starting points within the same packed painting group.
    2. Using similar logic, but in a reversed order, we will identify the ending points.
    3. Since a bridge should always start from 0 and end with the maximum length, we add these 2 points together with all starting points and end points.  The union all query results must come back with an even number (pairs of start and end point) for each bridge.
    4. In order to put starting point and ending point pairs into the same group, I use the Row_Number function:
      (Row_Number() Over(Partition By BridgePID Order By GapStart) - 1) / 2
      If the row number is 1 or 2, then return 0. 

 

Solution #2:   Provided by Alejandro Mesa

 

Code Snippet
  1. SET NOCOUNT ON;
  2. USE tempdb;
  3. GO
  4. Declare @BridgeP Table(
  5. BridgePID int,
  6. FromLen Decimal(14,4),
  7. ToLen Decimal(14,4)
  8. );
  9.  
  10. Declare @MaxBridgeLenght Decimal(14,4) = 100.0000;
  11.  
  12. Insert into @BridgeP
  13. Values
  14.   (1,10.0002,20.0004)
  15. , (2,18.0000,21.0004)
  16. , (3,11.12345,20.0002)
  17. , (4,40.0000,50.0003)
  18. , (5,19.1100,35.0000)
  19. ;
  20.  
  21. WITH C0 AS (
  22. SELECT
  23.     FromLen, ToLen
  24. FROM
  25.     @BridgeP
  26. UNION ALL
  27. SELECT
  28.     FromLen, ToLen
  29. FROM
  30.     (VALUES (0, 0), (100, 100) ) AS T(FromLen, ToLen)
  31. )
  32. , C1 AS (
  33. SELECT
  34.     FromLen AS ln,
  35.     +1 AS [type],
  36.     1 AS sub
  37. FROM
  38.     C0
  39.  
  40. UNION ALL
  41.  
  42. SELECT
  43.     ToLen AS ln,
  44.     -1 AS [type],
  45.     0 AS sub
  46. FROM
  47.     C0
  48. )
  49. , C2 AS (
  50. SELECT
  51.     ln,
  52.     SUM([type]) OVER(
  53.     ORDER BY ln, [type] DESC
  54.     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  55.     ) - sub AS cnt
  56. FROM
  57.     C1
  58. )
  59. , C3 AS (
  60. SELECT
  61.     ln, ((ROW_NUMBER() OVER(ORDER BY ln) - 1) / 2) + 1 AS grp
  62. FROM
  63.     C2
  64. WHERE
  65.     cnt = 0
  66. )
  67. , C4 AS (
  68. SELECT
  69.     MIN(ln) AS FromLen,
  70.     MAX(ln) AS ToLen
  71. FROM
  72.     C3
  73. GROUP BY
  74.     grp
  75. )
  76. , C5 AS (
  77. SELECT
  78.     *,
  79.     LEAD(FromLen, 1, ToLen) OVER(ORDER BY FromLen) AS nxt_FromLen
  80. FROM
  81.     C4
  82. )
  83. SELECT
  84.     ToLen AS GapStart,
  85.     nxt_FromLen AS GapEnd
  86. FROM
  87.     C5
  88. WHERE
  89.     nxt_FromLen - ToLen > 0.0001;
  90. GO
  91.  
  92. /*
  93.  
  94. GapStart    GapEnd
  95. 0.0000    10.0002
  96. 35.0000    40.0000
  97. 50.0003    100.0000
  98.  
  99. */

 

The beauty of this approach is that it could easily be adapted to report on multiple bridges or roads, whatever it is that you are painting, as in the following modified example. 

Code Snippet
  1. SET NOCOUNT ON;
  2. USE tempdb;
  3. GO
  4. DECLARE @BridgeP TABLE (
  5. BridgeID int NOT NULL,
  6. BridgePID int,
  7. FromLen Decimal(14,4),
  8. ToLen Decimal(14,4)
  9. );
  10.  
  11. Insert into @BridgeP
  12. Values
  13.   (1, 1,10.0002,20.0004)
  14. , (1, 2,18.0000,21.0004)
  15. , (1, 3,11.12345,20.0002)
  16. , (1, 4,40.0000,50.0003)
  17. , (1, 5,19.1100,35.0000)
  18.  
  19. , (2, 1,10.0002,20.0004)
  20. , (2, 2,18.0000,21.0004)
  21. , (2, 3,11.12345,20.0002)
  22. , (2, 4,40.0000,50.0003)
  23. , (2, 5,25.1100,35.0000)
  24.  
  25. ;
  26.  
  27. -- Itzik's approach
  28. WITH C0 AS (
  29. SELECT
  30.     BridgeID, FromLen, ToLen
  31. FROM
  32.     @BridgeP
  33. UNION ALL
  34. SELECT
  35.     B.BridgeID, T.FromLen, T.ToLen
  36. FROM
  37.     (SELECT DISTINCT BridgeID FROM @BridgeP) AS B
  38.     CROSS JOIN
  39.     (VALUES (0, 0), (100, 100) ) AS T(FromLen, ToLen)
  40. )
  41. , C1 AS (
  42. SELECT
  43.     BridgeID,
  44.     FromLen AS ln,
  45.     +1 AS [type],
  46.     1 AS sub
  47. FROM
  48.     C0
  49.  
  50. UNION ALL
  51.  
  52. SELECT
  53.     BridgeID,
  54.     ToLen AS ln,
  55.     -1 AS [type],
  56.     0 AS sub
  57. FROM
  58.     C0
  59. )
  60. , C2 AS (
  61. SELECT
  62.     BridgeID,
  63.     ln,
  64.     SUM([type]) OVER(
  65.     PARTITION BY BridgeID
  66.     ORDER BY ln, [type] DESC
  67.     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  68.     ) - sub AS cnt
  69. FROM
  70.     C1
  71. )
  72. , C3 AS (
  73. SELECT
  74.     BridgeID,
  75.     ln,
  76.     ((ROW_NUMBER() OVER(PARTITION BY BridgeID ORDER BY ln) - 1) / 2) + 1 AS grp
  77. FROM
  78.     C2
  79. WHERE
  80.     cnt = 0
  81. )
  82. , C4 AS (
  83. SELECT
  84.     BridgeID,
  85.     MIN(ln) AS FromLen,
  86.     MAX(ln) AS ToLen
  87. FROM
  88.     C3
  89. GROUP BY
  90.     BridgeID,
  91.     grp
  92. )
  93. , C5 AS (
  94. SELECT
  95.     BridgeID,
  96.     FromLen,
  97.     ToLen,
  98.     LEAD(FromLen, 1, ToLen) OVER(PARTITION BY BridgeID ORDER BY FromLen) AS nxt_FromLen
  99. FROM
  100.     C4
  101. )
  102. SELECT
  103.     BridgeID,
  104.     ToLen AS GapStart,
  105.     nxt_FromLen AS GapEnd
  106. FROM
  107.     C5
  108. WHERE
  109.     nxt_FromLen - ToLen > 0.0001;
  110. GO
  111.  
  112. /*
  113.  
  114. BridgeID    GapStart    GapEnd
  115. 1    0.0000    10.0002
  116. 1    35.0000    40.0000
  117. 1    50.0003    100.0000
  118. 2    0.0000    10.0002
  119. 2    21.0004    25.1100
  120. 2    35.0000    40.0000
  121. 2    50.0003    100.0000
  122.  
  123. */

 

Alejandro's Explanation:

I see here two problems in one. The first one would be finding overlapping ranges, and then identifying gaps.
 
Itzik Ben-Gan included these two problems in his last book about window functions. He explains the solutions for SQL Server 2012 and also for previous versions in the following book:
 
Microsoft® SQL Server® 2012 High-Performance T-SQL Using Window Functions
 http://shop.oreilly.com/product/0790145323088.do
 
I am going to use his approach to tackle your question since I didn't see any solution using the new support to the OVER clause (window functions).

Finding gaps using LEAD function is simple and I think that reading the code is enough. For the solution about packing overlapping ranges, the key is in splitting the range (each row) in two sets, one for the initial of the range (FromLen) and another for the end (ToLen), and adding a type (+1 or -1) depending if it is the start (FromLen) or the end of the range (ToLen). We are going to calculate the running total of [type] minus the [sub], and we will be interested only on those where the formula ([cnt]) is zero. This will give us pairs of start and end, and we can identify the pairs by enumerating those rows and using ((ROW_NUMBER() OVER(...) - 1 / 2) + 1), the rest is grouping and calculating the MIN and MAX per group.

 Note: I didn't include an ORDER BY clause for presentation purpose, neither setup proper indexes to support the OVER clause.

 

Solution #3:  Provided by Tom Cooper

 

Code Snippet
  1. -- Sample data
  2. Declare @Bridge Table(BridgePID int, fromlen int, tolen int);
  3. Insert @Bridge(BridgePID, fromlen, tolen) Values
  4. (1             , 10         ,20),
  5. (2             ,18          ,2),
  6. (3            ,11           ,20),
  7. (4            ,40           ,50),
  8. (5             ,19          ,35);
  9.  
  10. -- Create Numbers cte
  11. ;With N2 As (Select 1 As Number Union All Select 1),
  12. N4 As (Select na.Number From N2 na Cross Join N2 nb),
  13. N16 As (Select na.Number From N4 na Cross Join N4 nb),
  14. N256 As (Select na.Number From N16 na Cross Join N16 nb),
  15. Numbers As (Select Row_Number() Over (Order By n.Number) As Number From N256 n),
  16. -- Now get only the missing numbers and a row_number value over those numbers
  17. MissingNumbers As
  18. (Select n.Number, Row_Number() Over (Order By n.Number) As SeqNbr
  19. From Numbers n
  20. Where n.Number <= 100 And Not Exists(Select * From @Bridge b Where n.Number Between b.fromlen And b.tolen))
  21. -- Now product the islands report
  22. Select Min(m.Number) As GapStart, Max(m.Number) As GapEnd
  23. From MissingNumbers m
  24. Group By m.Number - m.SeqNbr;

 

Tom's Explanation:

As noted in the original forum post, this is essentially an “Islands and Gaps” problem.  The desired result is all the islands of unpainted points.  But the input data is somewhat atypical.  In a typical islands and gaps problem, the data is a series of points (for example, you have a table where each row had two columns – employeeid and sickday (datatype date)).  And you want a report show the ranges of days an employee was out sick.  So that you could produce a report like Employee 5 was out Jan 5 thru Jan 7.  There is a good deal of information on how to solve this type of problem available both online and in books.  One good discussion is by Itzik Ben-Gan at http://www.manning.com/nielsen/SampleChapter5.pdf.

So if the input data was a set of rows showing all the unpainted points, using that technique would solve our problem.  But instead, the data is a series of ranges of painted points (possibly overlapping).  So we want to first find all the points which are not in any of those ranges.  To do this we need to first get all numbers between the minimum and maximum points (1 and 100 in this case).  We do that either by using a Numbers table that was previously created or by building a Numbers cte as part of the query.  I choose to build one with a cte since that way I did not have to assume the OP had a permanent numbers table.  The method I used is a common one, the earliest reference I know to it is also by Itzik Ben-Gan at http://www.sqlmag.com/article/sql-server/virtual-auxiliary-table-of-numbers
Once I had this table of numbers I then kept only the numbers which were in the range 1 to100 and were not in any of the painted ranges in the original data.  What’s left is a standard islands and gaps problem. 

 

Conclusion:


As you can see, all three of the above solutions provide the result we were looking for, but do so in a very different style.  The original thread provides variations of the solutions presented here as well as further discussion.  Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Steven, Alejandro, and Tom for their valuable forums contribution and for contributing to this series!

Hope that helps,
Sam Lester (MSFT)

 

Contributor Bios:

Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.

Blog: www.MSBICOE.com | LinkedIn: Steven Wang | MSDN Profile: Steven Wang - Shangzhou

Alejandro Mesa has been working with SQL Server for more than 10 years, being an MVP since 2007.  His passion is T-SQL and he is active in both English and Spanish forums under the alias of Hunchback.

Tom Cooper began his programming career in 1968, began working with database software in 1977, and first worked with Microsoft SQL Server in 1994 (version 4.21).  He is now very happily retired.