As part of my research work, I generate thousands of simulations on an almost daily basis for various scenarios. One of my scenarios involves determining strategies for selecting the most profitable short and long stock entry/exit positions. Calculating this involves data mining stock history information, analyzing moving averages of the equities, and correlating parameters associated with the equities to find the the optimal entry and exit position for a given stock. I've found one way to calculate this is using brute force computational power to simulate the trades over historic time periods and measure the results. (There probably is an easier way with advanced calculus using derivatives, etc.). In order to play out the simulation, I need to track the position of the equity associated with the order - whether there is an order pending, if the position has been opened, or if the position has been closed, or if the order was cancelled. For example, I need to measure the dollar value of positions as well as cash in the portfolio to tell how much "buying power" remains to purchase additional equities at a given point of time in the simulation.
I track dates associated with all of the events, so a "normalized" way to do this is simply to look at the date as shown in these views:
1: CREATE VIEW [Simulator].[view_ValidOrders]
2: AS
3: SELECT PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen,
4: LimitPriceClose, FilledPriceClose, StopLimitPrice
5: FROM Simulator.PortfolioOrder
6: WHERE (CancelDate IS NULL);
7: CREATE VIEW [Simulator].[view_OpenOrders]
8: AS
9: SELECT PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen,
10: LimitPriceClose, FilledPriceClose, StopLimitPrice
11: FROM Simulator.view_ValidOrders
12: WHERE ((CloseDate IS NULL) and NOT (OpenDate IS NULL));
13:
14: CREATE VIEW [Simulator].[view_OpenOrders]
15: AS
16: SELECT PortfolioOrderId, PortfolioId, TradingSymbol, CreateDate, OpenDate, UpdateDate, CloseDate, TradingMethod, Shares, LimitPriceOpen, FilledPriceOpen,
17: LimitPriceClose, FilledPriceClose, StopLimitPrice
18: FROM Simulator.view_ValidOrders
19: WHERE ((CloseDate IS NULL) and NOT (OpenDate IS NULL));
I can then index on the cancel-date, open-date, and close-date in combination with other fields to try to optimize the queries. But, why do this, when you only need a single status since each state is mutually exclusive? Based on that, we can create a computed field called OrderStatus (we'll come back to the other computed column "Profit" in a moment) that just has the status as in:
CREATE TABLE [Simulator].[PortfolioOrder](
[PortfolioOrderId] [int] IDENTITY(1,1) NOT NULL,
[PortfolioId] [int] NOT NULL,
[TradingSymbol] [varchar](50) NOT NULL,
[OpenDate] [date] NULL,
[CloseDate] [date] NULL,
[CancelDate] [date] NULL,
[TradingMethod] [char](1) NOT NULL,
[Shares] [smallmoney] NOT NULL,
[FilledPriceOpen] [smallmoney] NULL,
[FilledPriceClose] [smallmoney] NULL,
...
[Profit] AS ([shares]*
case
when [TradingMethod]='L' then [FilledPriceClose]-[FilledPriceOpen]
else [FilledPriceOpen]-[FilledPriceClose] end) PERSISTED,
[OrderStatus] AS (CONVERT([tinyint],
case
when [CancelDate] IS NOT NULL then (0)
when [CloseDate] IS NULL AND [OpenDate] IS NULL then (1)
when [CloseDate] IS NULL AND [OpenDate] IS NOT NULL then (2)
else (3) end,(0))) PERSISTED,
...
So, how does the computed field help us with performance? First, I need to make sure the computed field is persisted and the caveat on that is that only deterministic values can be computed. Not only that, the computed field must be derived directly from columns in the table, rather than other functions that look at other tables. For example, I have a current value column on my portfolio that is computed based on the value of all of the underlying positions, but that column is not persist-able since it depends on a function that sums the results from the order detail data (I could use an indexed view for this to get around that, but there are quite a few rules and restrictions around indexed views - see my earlier blog on indexed views).
The second thing to do is to make sure that the storage for the computed field is optimal. I wouldn't normally use a floating value for example as a persisted field that gets indexed since floating values are approximate. Also, I want to make sure the computed field uses as little storage as possible, so that my index is as small as possible, thus potentially reducing the number of levels required for a large number of rows. That is why I use the CONVERT function for the order status - by default, SQL would set this to be an integer field which is 4 bytes, but all I really need is a single byte or TINYINT.
Note, that I also use the included columns technique to only include what is needed. I leverage another persisted computed field called profit rather than the all the source fields needed to calculate the profit so that a covering index will work for my queries, thus eliminating the need to do a lookup back to the base table (see the link at end of article for more info on covering indexes). This keeps the included data small, which also contributes to a smaller size index entry.
Using the profit field as an include field along with other relevant fields, and then using the order status gives us the following index, which is just right for the types of queries we need to run in order to perform a stock trading simulation. This allows us to be able to quickly calculate the profit (or loss) and hence current buying power for the portfolio as well as find orders in a given status (such as newly submitted orders with a limit price) that may need to still be opened into a position or open positions that need to be closed to seal a profit.
Index for finding profit of closed orders for portfolio using computed fields:
CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder]
(
[PortfolioId] ASC, -- 4 bytes
[OrderStatus] ASC -- 1 byte
)
INCLUDE ( [Profit]) - 4 bytes
Total: 9 bytes
Contrast that with the below index structure that had additional included columns for (Shares, PriceAtClose, PriceAtOpen instead of just Profit) and included the cancel date to filter out invalid orders, or alternatively did not include any of the date fields, but then had to filter after retrieving by PortfolioId.
Index for finding profit of closed orders for portfolio without benefit of computed fields:
CREATE NONCLUSTERED INDEX [IX_PortfolioOrder_Status] ON [Simulator].[PortfolioOrder]
(
[PortfolioId] ASC, -- 4 bytes
[CloseDate] ASC-- 4 bytes, probably not worth having this as part of index and just incorporate
-- into the include list and filter.
)
INCLUDE ( [FilledPriceAtOpen], [FilledPriceAtClose], [TradingMethod], [Shares]) - 13 bytes
Total: 21 bytes
So, how effective is this technique? I"m sorry I don't have hard numbers before-and-after, but I can honestly tell you that my simulations for this application are running about twice as fast now. I believe the main reason is that the number of index levels was reduced, because I was at a threshold due to the number of rows in the Portfolio Orders. Up to a few hundred thousand rows, the performance was similar, however once we cross that number, we require a second index level due to the sheer length of the index when using the . Each additional level in the index requires another I/O to per row lookup.
This brings up another point about indexes and performance. If you experience an overnight-drop in performance on some queries as more data is added, it's probably because one of your indexes just had to add another level to support the binary tree. You'll want to understand how SQL stores indexes to get a better insight on this and other ways to mitigate the issue if merely shrinking the size of the supporting columns is not feasible. Take a look especially at filtered indexing (which we could have used for this scenario, also), where an index can be defined based on horizontal partitioning - i.e. only create this index for rows meeting a specific filter.
This points out the need to carefully consider the lengths of the columns you are indexing as well as the sizes of the fields you include in your include list (at some point it doesn't make sense if you're including every field, may as well forget the list and do a lookup). First, do you really need to use the whole column? If not, you might want to use a computed field to get a substring of what you really need to index. Second, are you using the right data type? If you're shamelessly using GUIDs for primary keys, shame on you, you are wasting 4 times the storage space of a single identity (and not only that creating fragmentation because your inserts are happening all over the tree causing page splits).
I'd like to go deeper into how to calculate number of bytes per row in an index, etc, but don't have time and there are other resources that provide a much better explanation than I can give. For more information, see the following links:
Coming back from the Sci-Fi world of AI, etc to some real world scenarios... Earlier I posted a generic stored procedure that automatically unpivots data so that columns become rows. Today, I provide the inverse capability, although not truly generic at this point, but the technique is generic.
Recently, I've been working with Microsoft System Center Configuration Management (SCCM) reporting and needed to create a pivot view of a task sequence. If you're not immersed in SCCM, the example probably won't make much sense, but the technique for generating the view may be useful for other scenarios. SCCM provides a lot, (and I mean ALOT) of views and reports out of the box and it provides the ability to create your own reports. SCCM 2007 stores all of it's data in a SQL Server 2005 database along with the views. For more about SCCM, see http://www.microsoft.com/systemcenter/configurationmanager/en/us/default.aspx. For more about creating custom reports, see http://www.microsoft.com/downloads/details.aspx?FamilyId=87BBE64E-5439-4FC8-BECC-DEB372A40F4A&displaylang=en
As much as I'd like to get into SCCM more, let's focus on the pivot view generation which is applicable to any scenario. In our scenario, we have multiple steps for which to report the status for a task sequence. A task sequence is a series of tasks to perform on a selected set of computers (known as a collection). You can think of it as a workflow, and SCCM lets you define whether a sub-task must complete successfully, or may be allowed to fail and allow the task sequence to continue. Task sequences may be re-run on a given machine.
Our business case is that we need to track how well our task sequences are doing and we want to drill down on specific sub-tasks (steps) that have meaning to our specific workflow and provide a single view of how all the sub-tasks performed within a date range.
SCCM captures this information in a view they provide called I created a view on top of this to ensure that only the distinct steps are captured per advertisement)
CREATE VIEW [dbo].[V_Step_Summary] AS
SELECT AdvertisementId, Step,
SUM(CASE WHEN ExitCode = 0 THEN 1 ELSE 0 END) AS SuccessCount,
SUM(CASE WHEN ExitCode <> 0 THEN 1 ELSE 0 END) AS FailureCount,
COUNT(*) AS AttemptCount
FROM [dbo].[v_TaskExecutionStatus]
GROUP BY AdvertisementId, Step
I also created a view for extracting the status of the steps by date since the underlying data is stored by date/time, and this is SQL 2005, so I can't just down-convert to date. I need to be able to group by date, so that the users can see the count of success/failures in a single day.
CREATE VIEW [dbo].[V_Step_Summary_ByDate] AS
SELECT TOP 100 PERCENT AdvertisementId, Step,
DATEADD(DD,DATEDIFF(DD,0,ExecutionTime),0) AS ExecutionDate,
SUM(CASE WHEN ExitCode = 0 THEN 1 ELSE 0 END) AS SuccessCount,
SUM(CASE WHEN ExitCode <> 0 THEN 1 ELSE 0 END) AS FailureCount,
COUNT(*) AS AttemptCount
FROM [dbo].[v_TaskExecutionStatus]
GROUP BY AdvertisementId, Step,
DATEADD(DD,DATEDIFF(DD,0,ExecutionTime),0)
OK, so now we have the building blocks, let's look at the views that pivot the data. We have 3 views: 1 to pivot just the success count, 1 to pivot just the failure count, and 1 to outer join both together to give us both failures and success.
Brace yourself, here are the views:
CREATE VIEW [dbo].[V_StepPivot_SuccessCount] AS SELECT
AdvertisementID, ExecutionDate, 'SuccessCount' as CountType,
[00] AS [S00],
[01] AS [S01],
[02] AS [S02],
[03] AS [S03],
[04] AS [S04],
[05] AS [S05],
[06] AS [S06],
[07] AS [S07],
[08] AS [S08],
[09] AS [S09],
[10] AS [S10],
[11] AS [S11],
[12] AS [S12],
[13] AS [S13],
[14] AS [S14],
[15] AS [S15],
[16] AS [S16],
[17] AS [S17],
[18] AS [S18],
[19] AS [S19],
[20] AS [S20],
[21] AS [S21],
[22] AS [S22],
[23] AS [S23],
[24] AS [S24],
[25] AS [S25],
[26] AS [S26],
[27] AS [S27],
[28] AS [S28],
[29] AS [S29],
[30] AS [S30],
[31] AS [S31],
[32] AS [S32],
[33] AS [S33],
[34] AS [S34],
[35] AS [S35],
[36] AS [S36],
[37] AS [S37],
[38] AS [S38],
[39] AS [S39],
[40] AS [S40],
[41] AS [S41],
[42] AS [S42],
[43] AS [S43],
[44] AS [S44],
[45] AS [S45],
[46] AS [S46],
[47] AS [S47],
[48] AS [S48],
[49] AS [S49],
[50] AS [S50],
[51] AS [S51],
[52] AS [S52],
[53] AS [S53],
[54] AS [S54],
[55] AS [S55],
[56] AS [S56],
[57] AS [S57],
[58] AS [S58],
[59] AS [S59],
[60] AS [S60],
[61] AS [S61],
[62] AS [S62],
[63] AS [S63],
[64] AS [S64],
[65] AS [S65],
[66] AS [S66],
[67] AS [S67],
[68] AS [S68]
FROM (SELECT AdvertisementID, ExecutionDate,Step,SuccessCount FROM dbo.V_Step_Summary_ByDate) p
PIVOT (SUM (SuccessCount) For Step in (
[00],
[01],
[02],
[03],
[04],
[05],
[06],
[07],
[08],
[09],
[10],
[11],
[12],
[13],
[14],
[15],
[16],
[17],
[18],
[19],
[20],
[21],
[22],
[23],
[24],
[25],
[26],
[27],
[28],
[29],
[30],
[31],
[32],
[33],
[34],
[35],
[36],
[37],
[38],
[39],
[40],
[41],
[42],
[43],
[44],
[45],
[46],
[47],
[48],
[49],
[50],
[51],
[52],
[53],
[54],
[55],
[56],
[57],
[58],
[59],
[60],
[61],
[62],
[63],
[64],
[65],
[66],
[67],
[68]
)) AS PVT
GO
CREATE VIEW [dbo].[V_StepPivot_FailureCount] AS SELECT
AdvertisementID, ExecutionDate, 'FailureCount' as CountType,
[00] AS [F00],
[01] AS [F01],
[02] AS [F02],
[03] AS [F03],
[04] AS [F04],
[05] AS [F05],
[06] AS [F06],
[07] AS [F07],
[08] AS [F08],
[09] AS [F09],
[10] AS [F10],
[11] AS [F11],
[12] AS [F12],
[13] AS [F13],
[14] AS [F14],
[15] AS [F15],
[16] AS [F16],
[17] AS [F17],
[18] AS [F18],
[19] AS [F19],
[20] AS [F20],
[21] AS [F21],
[22] AS [F22],
[23] AS [F23],
[24] AS [F24],
[25] AS [F25],
[26] AS [F26],
[27] AS [F27],
[28] AS [F28],
[29] AS [F29],
[30] AS [F30],
[31] AS [F31],
[32] AS [F32],
[33] AS [F33],
[34] AS [F34],
[35] AS [F35],
[36] AS [F36],
[37] AS [F37],
[38] AS [F38],
[39] AS [F39],
[40] AS [F40],
[41] AS [F41],
[42] AS [F42],
[43] AS [F43],
[44] AS [F44],
[45] AS [F45],
[46] AS [F46],
[47] AS [F47],
[48] AS [F48],
[49] AS [F49],
[50] AS [F50],
[51] AS [F51],
[52] AS [F52],
[53] AS [F53],
[54] AS [F54],
[55] AS [F55],
[56] AS [F56],
[57] AS [F57],
[58] AS [F58],
[59] AS [F59],
[60] AS [F60],
[61] AS [F61],
[62] AS [F62],
[63] AS [F63],
[64] AS [F64],
[65] AS [F65],
[66] AS [F66],
[67] AS [F67],
[68] AS [F68]
FROM (SELECT AdvertisementID, ExecutionDate,Step,FailureCount FROM dbo.V_Step_Summary_ByDate) p
PIVOT (SUM (FailureCount) For Step in (
[00],
[01],
[02],
[03],
[04],
[05],
[06],
[07],
[08],
[09],
[10],
[11],
[12],
[13],
[14],
[15],
[16],
[17],
[18],
[19],
[20],
[21],
[22],
[23],
[24],
[25],
[26],
[27],
[28],
[29],
[30],
[31],
[32],
[33],
[34],
[35],
[36],
[37],
[38],
[39],
[40],
[41],
[42],
[43],
[44],
[45],
[46],
[47],
[48],
[49],
[50],
[51],
[52],
[53],
[54],
[55],
[56],
[57],
[58],
[59],
[60],
[61],
[62],
[63],
[64],
[65],
[66],
[67],
[68]
)) AS PVT
CREATE VIEW [dbo].[V_StepPivot_FailureCount] AS SELECT
AdvertisementID, ExecutionDate, 'FailureCount' as CountType,
[00] AS [F00],
[01] AS [F01],
[02] AS [F02],
[03] AS [F03],
[04] AS [F04],
[05] AS [F05],
[06] AS [F06],
[07] AS [F07],
[08] AS [F08],
[09] AS [F09],
[10] AS [F10],
[11] AS [F11],
[12] AS [F12],
[13] AS [F13],
[14] AS [F14],
[15] AS [F15],
[16] AS [F16],
[17] AS [F17],
[18] AS [F18],
[19] AS [F19],
[20] AS [F20],
[21] AS [F21],
[22] AS [F22],
[23] AS [F23],
[24] AS [F24],
[25] AS [F25],
[26] AS [F26],
[27] AS [F27],
[28] AS [F28],
[29] AS [F29],
[30] AS [F30],
[31] AS [F31],
[32] AS [F32],
[33] AS [F33],
[34] AS [F34],
[35] AS [F35],
[36] AS [F36],
[37] AS [F37],
[38] AS [F38],
[39] AS [F39],
[40] AS [F40],
[41] AS [F41],
[42] AS [F42],
[43] AS [F43],
[44] AS [F44],
[45] AS [F45],
[46] AS [F46],
[47] AS [F47],
[48] AS [F48],
[49] AS [F49],
[50] AS [F50],
[51] AS [F51],
[52] AS [F52],
[53] AS [F53],
[54] AS [F54],
[55] AS [F55],
[56] AS [F56],
[57] AS [F57],
[58] AS [F58],
[59] AS [F59],
[60] AS [F60],
[61] AS [F61],
[62] AS [F62],
[63] AS [F63],
[64] AS [F64],
[65] AS [F65],
[66] AS [F66],
[67] AS [F67],
[68] AS [F68]
FROM (SELECT AdvertisementID, ExecutionDate,Step,FailureCount FROM dbo.V_Step_Summary_ByDate) p
PIVOT (SUM (FailureCount) For Step in (
[00],
[01],
[02],
[03],
[04],
[05],
[06],
[07],
[08],
[09],
[10],
[11],
[12],
[13],
[14],
[15],
[16],
[17],
[18],
[19],
[20],
[21],
[22],
[23],
[24],
[25],
[26],
[27],
[28],
[29],
[30],
[31],
[32],
[33],
[34],
[35],
[36],
[37],
[38],
[39],
[40],
[41],
[42],
[43],
[44],
[45],
[46],
[47],
[48],
[49],
[50],
[51],
[52],
[53],
[54],
[55],
[56],
[57],
[58],
[59],
[60],
[61],
[62],
[63],
[64],
[65],
[66],
[67],
[68]
)) AS PVT
GO
CREATE VIEW [dbo].[V_StepPivot_Count] AS SELECT sc.AdvertisementID, sc.ExecutionDate,
[S00], [F00],
[S01], [F01],
[S02], [F02],
[S03], [F03],
[S04], [F04],
[S05], [F05],
[S06], [F06],
[S07], [F07],
[S08], [F08],
[S09], [F09],
[S10], [F10],
[S11], [F11],
[S12], [F12],
[S13], [F13],
[S14], [F14],
[S15], [F15],
[S16], [F16],
[S17], [F17],
[S18], [F18],
[S19], [F19],
[S20], [F20],
[S21], [F21],
[S22], [F22],
[S23], [F23],
[S24], [F24],
[S25], [F25],
[S26], [F26],
[S27], [F27],
[S28], [F28],
[S29], [F29],
[S30], [F30],
[S31], [F31],
[S32], [F32],
[S33], [F33],
[S34], [F34],
[S35], [F35],
[S36], [F36],
[S37], [F37],
[S38], [F38],
[S39], [F39],
[S40], [F40],
[S41], [F41],
[S42], [F42],
[S43], [F43],
[S44], [F44],
[S45], [F45],
[S46], [F46],
[S47], [F47],
[S48], [F48],
[S49], [F49],
[S50], [F50],
[S51], [F51],
[S52], [F52],
[S53], [F53],
[S54], [F54],
[S55], [F55],
[S56], [F56],
[S57], [F57],
[S58], [F58],
[S59], [F59],
[S60], [F60],
[S61], [F61],
[S62], [F62],
[S63], [F63],
[S64], [F64],
[S65], [F65],
[S66], [F66],
[S67], [F67],
[S68], [F68]
FROM V_StepPivot_FailureCount fc
FULL OUTER JOIN V_StepPivot_SuccessCount sc
ON sc.AdvertisementId = fc.AdvertisementId
AND sc.ExecutionDate = fc.ExecutionDate
Wow, 68 columns are required - 1 for each step! That's a lot of pivoting. If you think I typed in all 68 column names, you're wrong! I generated the code using the below stored procedure. I didn't bother using SP_EXECUTESQL - just output and then manually cut and paste the output into a new query window and then run it from there. It wouldn't be that hard to put this all into a variable and then execute the variable to build the view directly from the stored proc. Note the retrieval from V_Step_Summary in order to generate the list of columns
CREATE PROCEDURE [dbo].[USP_Generate_StepPivot_Views_V3] AS
BEGIN
SET NOCOUNT ON
SELECT 'CREATE VIEW V_StepPivot_SuccessCount AS SELECT ' AS Line
UNION ALL SELECT ' AdvertisementID, ExecutionDate, ''SuccessCount'' as CountType,'
UNION ALL select distinct
(' [' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) + ']')
+
(' AS [S' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) +
(CASE WHEN STEP = (SELECT MAX(STEP) FROM V_Step_Summary) THEN ']' ELSE '],' END))
FROM dbo.V_Step_Summary
UNION ALL SELECT ' FROM (SELECT AdvertisementID, ExecutionDate,Step,SuccessCount FROM dbo.V_Step_Summary_ByDate) p'
UNION ALL SELECT ' PIVOT (SUM (SuccessCount) For Step in ('
-- This part generates the pivoted column list by selecting the distinct rows that
-- correspond to the pivot columns to create. The case statement is used to identify
-- the last step.
UNION ALL select distinct
(' [' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) +
(CASE WHEN STEP = (SELECT MAX(STEP) FROM V_Step_Summary) THEN ']' ELSE '],' END))
FROM dbo.V_Step_Summary
UNION ALL SELECT ' )) AS PVT'
UNION ALL SELECT 'GO'
UNION ALL SELECT 'CREATE VIEW V_StepPivot_FailureCount AS SELECT ' AS Line
UNION ALL SELECT ' AdvertisementID, ExecutionDate, ''FailureCount'' as CountType,'
UNION ALL select distinct
(' [' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) +']')
+
(' AS [F' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) +
(CASE WHEN STEP = (SELECT MAX(STEP) FROM V_Step_Summary) THEN ']' ELSE '],' END))
FROM dbo.V_Step_Summary
UNION ALL SELECT ' FROM (SELECT AdvertisementID, ExecutionDate,Step,FailureCount FROM dbo.V_Step_Summary_ByDate) p'
UNION ALL SELECT ' PIVOT (SUM (FailureCount) For Step in ('
-- This part generates the pivoted column list by selecting the distinct rows that
-- correspond to the pivot columns to create. The case statement is used to identify
-- the last step.
UNION ALL select distinct
(' [' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) +
(CASE WHEN STEP = (SELECT MAX(STEP) FROM V_Step_Summary) THEN ']' ELSE '],' END))
FROM dbo.V_Step_Summary
UNION ALL SELECT ' )) AS PVT'
UNION ALL SELECT 'GO'
UNION ALL SELECT 'CREATE VIEW V_StepPivot_Count AS SELECT sc.AdvertisementID, sc.ExecutionDate, '
UNION ALL select distinct
(' [S' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END) + '],'
+
(' [F' + (CASE WHEN STEP < 10 THEN '0' + CONVERT(VARCHAR(1), STEP) ELSE CONVERT(VARCHAR(2),Step) END)
+
(CASE WHEN STEP = (SELECT MAX(STEP) FROM V_Step_Summary) THEN ']' ELSE '],' END)))
FROM dbo.V_Step_Summary
UNION ALL SELECT ' FROM V_StepPivot_FailureCount fc'
UNION ALL SELECT ' FULL OUTER JOIN V_StepPivot_SuccessCount sc '
UNION ALL SELECT ' ON sc.AdvertisementId = fc.AdvertisementId'
UNION ALL SELECT ' AND sc.ExecutionDate = fc.ExecutionDate'
UNION ALL SELECT 'GO'
END
OK, so you ask "how long did it take you to write that stored procedure"? Didn't it take longer than if you'd just written the views by hand? Yes, probably did, but not by much. And I didn't have to spend much time testing the output view, since it was generated directly from the data that defines the view as being correct. Plus what happens if the number of steps change, do I want to change the pivot view by hand each time? For example, we could actually modify the stored proc to put the code into a variable which it then executes, rather than outputting. Once that is in place, we can create a scheduled task to automatically run this stored procedure on a scheduled basis to ensure that the pivot view is always correct.
Technorati Tags:
SQL Server,
SCCM,
Tips
My academic research relates to software automation – for me that means code writing code. Most code generators today are pretty stupid, though. My hypothesis is basically that software should be able to learn from experience so that it can not only generate tedious code, but also discover algorithms based on results of code execution. So, my goal isn’t just to develop software that generates mundane code components like simple data classes or default stored procedures, but build generators that learn from examining the environment along with trial and error in order to write intelligent systems.
To illustrate this point, I am using Towers of Hanoi as a proof-of-concept of a very simple example for how software can learn an algorithm and then write code that includes the algorithm for a problem domain. The recursive algorithm for solving towers of Hanoi is rather simple – see http://www.cs.cmu.edu/~cburch/survey/recurse/hanoiimpl.html. I found an example in C# for solving the puzzle using the recursive algorithm at http://azerdark.blogspot.com/2008/07/c-tower-of-hanoi.html
Here is a walk-through:
Let Hn = # moves for a stack of n disks.
Here is the optimal strategy:
Move top n−1 disks to spare peg. (Hn−1 moves)
Move bottom disk. (1 move)
Move top n−1 to bottom disk. (Hn−1 moves)
Note that: Hn = 2Hn−1 + 1
The # of moves is described by the below recurrence relation:
Solving Tower of Hanoi RR
Hn = 2 Hn−1 + 1
= 2 (2 Hn−2 + 1) + 1 = 22 Hn−2 + 2 + 1
= 22(2 Hn−3 + 1) + 2 + 1 = 23 Hn−3 + 22 + 2 + 1
…
= 2n−1 H1 + 2n−2 + … + 2 + 1
= 2n−1 + 2n−2 + … + 2 + 1 (since H1 = 1)
=
= 2n − 1
The code is fairly trivial:
procedure Hanoi(n: integer; source, dest, by: char);
Begin
if (n=1) then
writeln('Move the plate from ', source, ' to ', dest)
else begin
Hanoi(n-1, source, by, dest);
Hanoi(1, source, dest, by);
Hanoi(n-1, by, dest, source);
end;
End;
But, suppose I didn’t know the algorithm and I want to write code that can discover the algorithm for the game (or potentially any other game) by simply playing the game and looking for patterns. To do such a task requires a framework for capturing the moves and analyzing the states of all the pieces (variables) involved. The framework also needs to be able to know the rules are for performing the moves and how to determine if victory is achieved. In order to do this, I could have targeted the Hanoi puzzle specifically, but I want a framework that will work for potentially any game.
Enter the Gaming Automation Framework (GAF). To build this framework, I utilize SQL Server 2008 as my storage engine as well as leveraging functions and triggers in order to support a schema that not only supports playing the game, but also enforces rules on a game-by-game basis and supports functions for checking the status of a game as well as dictating the best moves.
Here is an overview of the GAF Components:
- Game Instantiation SQL Server Database
- Schema for game definitions and instantiation of games
- Database trigger framework for invoking functions enforcing rules that are specific to the game being played and for recording state of the game after each move.
- Stored procedures for creating games, game instances, performing moves, and undoing moves.
- Database function framework for defining what constitutes a valid move and what state arrangement defines success (i.e. winning the game).
- Game Player/Solver windows form
- Specification of the game to play along with the number of variable instantiations (i.e. for tower of Hanoi we might chose 3 disks, 4 disks, 5 disks, etc. For checkers we might chose 8 or 12 pieces, etc.)
- Function to play game using depth-first-search (DFS) to probe all moves in order to generate a solution.
- Function to analyze the game moves and determine a pattern and generate code to automatically solve the game (I’m still working on this part).
Let’s step through the major components:
Below are the schemas for game and game variable. Currently this only supports contiguous value check for the variables. I plan to improve this by allowing a discrete ranges to be associated with the variables.
CREATE TABLE [Games].[Game](
[GameId] [int] IDENTITY(1,1) NOT NULL,
[GameName] [varchar](50) NOT NULL,
[MoveRuleFunctionName] [varchar](256) NOT NULL,
–- function to get list of valid moves
[GameStatusFunctionName] [varchar](256) NOT NULL,
–- function to check if game over/won/lost
[BestMoveRuleFunctionName] [varchar](256) NULL,
CONSTRAINT [PK_Game_1] PRIMARY KEY CLUSTERED
(
[GameId] ASC
))
CREATE TABLE [Games].[GameVariable](
-- The “domain” of pieces”
[GameId] [int] NOT NULL,
[VariableId] [int] IDENTITY(1,1) NOT NULL,
[VariableName] [varchar](50) NOT NULL,
[MinInstances] [int] NOT NULL, -- Mininum number of “pieces”
[MaxInstances] [int] NOT NULL, – Max number of pieces.
[SimulationInstances] [int] NOT NULL,
-– This is the number of brute force simulations
-- to perform in order to find a solution, not currently using.
[MinValue] [int] NOT NULL,
–- min value or (range) (i.e. 1 for peg 1 for Hanoi
-- that the piece interacts with
[MaxValue] [int] NOT NULL,
–- max value (i.e. 3 for Hanoi for peg 3
[StartValue] [int] NOT NULL,
-- staring value of the piece – i.e. peg 1 for Hanoi
CONSTRAINT [PK_GameVariable] PRIMARY KEY CLUSTERED
(
[VariableId] ASC
)
So for the Tower of Hanoi, we define 3 variable rows to represent the disks and use the min and max value to control what peg the disks may be placed on.
Game Table Data:
The game table provides a level of abstraction from the specific game by storing metadata about how to play the game, check the status, and win the game in terms of the names of the functions. This allows multiple games to be setup with different rules. The simulation engine can then evaluate for a particular instance of a game which functions apply.
Games.GameVariable
The game variable relates to the game table and provides information on the variables to be used. I use the terms “variable”/“domain”/”game piece” and “domain”, ”value”, “location” interchangeably not just to confuse you, but because these are different ways to think of these entities depending on the type of scenario we are solving for. However, they are analogous and represent the mapping of a variable-value assignment. This could be a game piece/location, but could represent other types of domain/range pairs.
For our tower of Hanoi example, we only use a single variable to represent the disk and use the instance fields to control how many disks there are. In a a game where there are different types of pieces, separate variables would be created. To represent the valid states of the variable – we utilize start value to represent starting position, final value to represent winning position, and min and max to control the range of values. I plan to improve this so that not only can a range of values, but so that discrete sets of values can be used. For Tower of Hanoi, however this suffices.
Here are the tables used to support instances of the game. Note the use of the columns to store game-state for each move. This is the key component used for the automation aspect in order to discover patterns and translate into a solution algorithm.
CREATE TABLE [Games].[Instance](
[InstanceId] [int] IDENTITY(1,1) NOT NULL,
[GameId] [int] NOT NULL,
[GameDescription] [varchar](255) NULL,
[VariableInstanceCount] [int] NOT NULL,
[DomainStates] [varchar](max) NULL,
-– Each char represents a move and the value in the character
-- is the variable moved. (i.e. disk number)
[RangeStates] [varchar](max) NULL,
-– Each char represents a move and the value in the character
-- is the value mapped to the moved variable (i.e. peg)
[GameStatus] AS ([games].[fn_CheckHanoiGameStatus]([InstanceId],(1))),
CONSTRAINT [PK_GameInstance] PRIMARY KEY CLUSTERED
(
[InstanceId] ASC
))
CREATE TABLE [Games].[InstanceMove](
[InstanceMoveId] [int] IDENTITY(1,1) NOT NULL,
[MoveNumber] [int] NULL,
[InstanceId] [int] NOT NULL,
[InstanceVariableId] [int] NOT NULL,
-- The “piece” that was moved
[InstanceVariableValue] [int] NOT NULL,
-– The “location” that piece moved to
CONSTRAINT [PK_GameMove] PRIMARY KEY CLUSTERED
(
[InstanceMoveId] ASC
))
CREATE TABLE [Games].[InstanceVariable](
-- Current value for the piece (“domain" or variable)
-- and location (“range” or value)
[InstanceVariableId] [int] IDENTITY(1,1) NOT NULL,
[VariableId] [int] NOT NULL,
[InstanceId] [int] NOT NULL,
[VariableCurrentValue] [int] NOT NULL,
[InstanceVariableNumber] [int] NOT NULL,
CONSTRAINT [PK_MovePosition] PRIMARY KEY CLUSTERED
(
[InstanceVariableId] ASC
))
CREATE TABLE [Games].[InstanceVariableValueStates](
-- One per combination of domain and range (piece/location)
[InstanceId] [int] NOT NULL,
[InstanceVariableNumber] [int] NOT NULL,
[InstanceVariableValue] [int] NOT NULL,
[MoveStates] [varchar](max) NOT NULL,
–- Bit mask indicating if the variable
-- was moved on the bit corresponding to the move to the specified value.
CONSTRAINT [PK_InstanceVariableValueStates] PRIMARY KEY CLUSTERED
(
[InstanceId] ASC,
[InstanceVariableNumber] ASC,
[InstanceVariableValue] ASC
-- Please don’t get mad I’m using a compound primary key instead of generating
–- a surrogate, think of this as mainly a logical design at this point…
)
Here is the relationship diagram that shows how games and game instances inter-relate. The game instance is just that – an instance of a game, we could have multiple instances per games. Note that this isn’t a perfectly normalized design, some redundancy has been put in to avoid having to write complex queries to link game moves and their variables states back to the instance. We can clean that up later.
A trigger automatically sets the move number and sets the values in the Instance InstanceVariable after each move as well as performing some other tasks that we will get into later. Having the latest instance variable value is handy for processes that involve checking rules, etc, because they are interested in the latest state. They are usually interested in the prior state, also, so I’ll probably enhance this to store the prior value as well.
CREATE TRIGGER [Games].[InstanceMove_ITrig]
ON [Games].[InstanceMove]
AFTER INSERT
AS
BEGIN
DECLARE @InstanceId INT
DECLARE @InstanceVariableNumber INT
DECLARE @InstanceVariableValue INT
DECLARE @LastVariableNumber INT = –1
-- To force variable to be evaluated as null
DECLARE @OldVariableValue INT
DECLARE @InstanceMoveId INT
DECLARE @CustomErrorMessage varchar(255)=''
SET XACT_ABORT ON
BEGIN TRANSACTION
BEGIN TRY
-- Get variable information
SELECT
@InstanceId = INSERTED.InstanceId ,
@InstanceMoveId = INSERTED.InstanceMoveId,
@InstanceVariableValue= INSERTED.InstanceVariableValue,
@InstanceVariableNumber = iv.InstanceVariableNumber
FROM INSERTED
INNER JOIN Games.InstanceVariable iv
ON iv.InstanceVariableId = INSERTED.InstanceVariableId
IF (SELECT COUNT(*) FROM Games.InstanceMove
WHERE InstanceId = @InstanceID) > 1
BEGIN
SELECT TOP 1 @LastVariableNumber = InstanceVariableNumber
FROM Games.InstanceMove im
INNER JOIN Games.InstanceVariable iv
ON im.InstanceVariableId
= iv.InstanceVariableId
WHERE im.InstanceId = @InstanceId
AND im.InstanceMoveId <> @InstanceMoveId
ORDER BY MoveNumber DESC
END
-- Set move number
UPDATE [Games].[InstanceMove]
SET MoveNumber =
(SELECT COUNT(*) FROM [Games].[InstanceMove]
WHERE InstanceId = INSERTED.InstanceId)
FROM INSERTED
WHERE INSERTED.InstanceMoveId = InstanceMove.InstanceMoveID
-- TODO: This part of the code is currently specific to Tower of Hanoi,
need to use the MoveResultStatus in order
-- to check if this is a bad move and use the mappings to display
-- the error message. That will
-- allow this trigger to work for any game, not just Tower of Hanoi.
DECLARE @MoveResult INT -- 1 if valid, 0 if invalid, -1 if
-- no more moves left
SET @MoveResult = Games.fn_CheckHanoiMove
(@InstanceId, @InstanceVariableNumber,@InstanceVariableValue,
@LastVariableNumber,
@OldVariableValue, 1)
IF @MoveResult <= –2
-- Ignore -1 error, since that means we just now ran out of moves,
just need to error out if we have used up all moves
BEGIN
SET @CustomErrorMessage = 'Invalid Move: ' +
(CASE
WHEN @MoveResult = -6 THEN 'Out of Moves'
WHEN @MoveResult = -2 THEN 'Destination peg '
+ CONVERT(VARCHAR(2),@InstanceVariableValue)
+ ' has a smaller disk on it already than disk '
+ CONVERT(VARCHAR(2),@InstanceVariableNumber)
WHEN @MoveResult = -3 THEN 'Source peg '
+ CONVERT(VARCHAR(2),@OldVariableValue)
+ ' has a smaller disk on top than disk '
+ CONVERT(VARCHAR(2),@InstanceVariableNumber)
WHEN @MoveResult = -4 THEN 'Disk '
+ CONVERT(VARCHAR(2),@InstanceVariableNumber)
+ ' moved to same peg '
+ CONVERT(VARCHAR(2),@InstanceVariableValue) + ' as it is already on'
WHEN @MoveResult = -5 THEN 'Disk '
+ CONVERT(VARCHAR(2),@InstanceVariableNumber)
+ ' moved twice in a row'
WHEN @MoveResult = -7 THEN
'Biggest disk Must be moved to center peg at the game point in order to win '
END)
RAISERROR (@CustomErrorMessage,16,1)
END
-- Update Instance Variable
UPDATE Games.InstanceVariable
SET VariableCurrentValue = Inserted.InstanceVariableValue
FROM INSERTED
INNER JOIN Games.InstanceVariable IV
ON IV.InstanceVariableId = INSERTED.InstanceVariableId
-- Update bit masks for moves
UPDATE Games.InstanceVariableValueStates
SET MoveStates = COALESCE(MoveStates,'') + '1'
WHERE InstanceId = @InstanceId
AND InstanceVariableNumber = @InstanceVariableNumber
AND InstanceVariableValue= @InstanceVariableValue
UPDATE Games.InstanceVariableValueStates
SET MoveStates = COALESCE(MoveStates,'') + '0'
WHERE InstanceId = @InstanceId
AND NOT (InstanceVariableNumber = @InstanceVariableNumber
AND InstanceVariableValue= @InstanceVariableValue)
UPDATE Games.Instance
SET
DomainStates = COALESCE(DomainStates + ',','')
+ CONVERT(VARCHAR(2),@InstanceVariableNumber),
RangeStates = COALESCE(RangeStates + ',','')
+ CONVERT(VARCHAR(2),@InstanceVariableValue)
WHERE InstanceId = @InstanceId
COMMIT TRANSACTION
There are 4 stored procedures used to play the game
- InsertInstance – Creates a new game. Based on the passed-in number of instances which gets set on the Game, Instance variables are generated by the trigger.
- InsertMove – Inserts a new move which then results in the execution of the InstanceMove_ITrig trigger above. Returns back the status of the game.
- UndoMove – Rolls back a move. This is used when the simulator plays a “loosing” move and allows the simulator to try all different game combinations without having to start all over when hitting a dead-end.
- GetInstanceStates – Gets the state data from the Instance table for a particular game instance. The states can then be analyzed for patterns to automate solution discovery.
These are all simple stored procedures with the exception of the UndoMove:
CREATE PROCEDURE [Games].[UndoLastMove]
-- Add the parameters for the stored procedure here
@InstanceId INT
AS
BEGIN
SET NOCOUNT ON
SET XACT_ABORT ON
DECLARE @MoveNumber INT = 0
SELECT @MoveNumber = COUNT(*) FROM Games.InstanceMove
WHERE InstanceId= @InstanceId
IF @MoveNumber = 0
BEGIN
RETURN -- Nothing to do
END
BEGIN TRANSACTION
BEGIN TRY
-- Identify the variable and value affected by the last move
-- and the last move id
DECLARE @InstanceVariableId INT = 0
DECLARE @InstanceVariableValue INT = 0
DECLARE @InstanceMoveId INT = 0
DECLARE @PriorInstanceVariableValue INT = 0
-- Find the current variable and value affected
SELECT @InstanceVariableId = InstanceVariableId,
@InstanceVariableValue = InstanceVariableValue,
@InstanceMoveId = InstanceMoveId
FROM Games.InstanceMove
WHERE InstanceId = @InstanceId
AND MoveNumber = @MoveNumber
-- Delete the move
DELETE FROM Games.InstanceMove
WHERE InstanceId = @InstanceId
AND InstanceMoveId = @InstanceMoveId
-- Find the prior value for the variable, since the most current
-- is now deleted
SELECT TOP 1 @PriorInstanceVariableValue =
(CASE WHEN InstanceVariableValue IS NULL THEN gv.StartValue
ELSE InstanceVariableValue END)
FROM Games.GameVariable gv
INNER JOIN Games.InstanceVariable iv
ON iv.VariableId = gv.VariableId
LEFT JOIN Games.InstanceMove im
ON im.InstanceVariableId = iv.InstanceVariableId
WHERE iv.InstanceId = @InstanceId
AND iv.InstanceVariableId = @InstanceVariableId
ORDER BY MoveNumber DESC
-- Drop the last state from Games.Instance
UPDATE Games.Instance
SET DomainStates =
(CASE WHEN @MoveNumber > 1
THEN SUBSTRING(DomainStates,1,DATALENGTH(DomainStates) -
CHARINDEX(',',REVERSE(DomainStates)))
ELSE NULL END),
RangeStates =
(CASE WHEN @MoveNumber > 1
THEN SUBSTRING(RangeStates,1,DATALENGTH(RangeStates) -
CHARINDEX(',',REVERSE(RangeStates)))
ELSE NULL END)
WHERE InstanceId = @InstanceId
-- Drop the last bit all the Games.InstanceVariableValueStates
UPDATE Games.InstanceVariableValueStates
SET MoveStates =
(CASE WHEN @MoveNumber > 1 THEN SUBSTRING(MoveStates,1,
@MoveNumber -1) ELSE '' END)
WHERE InstanceId = @InstanceId
-- Correct the current value on InstanceVariable
UPDATE Games.InstanceVariable
SET VariableCurrentValue = @PriorInstanceVariableValue
WHERE InstanceVariableId = @InstanceVariableId
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
EXECUTE usp_GetErrorInfo
END CATCH
END
CREATE FUNCTION [Games].[fn_CheckHanoiGameStatus]
(
-- Add the parameters for the function here
@InstanceId INT,
@PostMoveCheck bit = 0
)
RETURNS INT
AS
BEGIN
-- Declare the return variable here
-- -1 = Game lost, 0 = Undecided, 1 = Game won, -2 = Game not started
DECLARE @ResultVar INT = 0
DECLARE @NumberOfDisks INT = 0
DECLARE @NumberOfMoves INT = 0
DECLARE @NumberOfDisksOnLastPeg INT = 0
DECLARE @MaxMoves INT = 0;
DECLARE @LastMoveVariableId INT = 0;
DECLARE @TwoMovesBackVariableId INT = 0;
DECLARE @MinVariableId INT = 0
SELECT @NumberOfMoves = COUNT(*) FROM Games.InstanceMove
WHERE InstanceId = @InstanceId
-- SELECT @NumberOfDisks = COUNT(*) FROM Games.InstanceVariable
-- WHERE InstanceId = @InstanceId
SELECT @NumberOfDisks = VariableInstanceCount, @MaxMoves =
(POWER(2,@NumberOfDisks) - 1)
FROM Games.Instance
WHERE InstanceId = @InstanceId
SELECT @MinVariableId = MIN(InstanceVariableId)
FROM Games.InstanceVariable
WHERE InstanceId = @InstanceId
IF @PostMoveCheck = 1
BEGIN
SELECT @TwoMovesBackVariableId = InstanceVariableId
FROM Games.InstanceMove im
WHERE MoveNumber = (@NumberOfMoves - 2)
AND InstanceId = @InstanceId
SELECT @LastMoveVariableId = InstanceVariableId
FROM Games.InstanceMove
WHERE MoveNumber = @NumberOfMoves
AND InstanceId = @InstanceId
END
DECLARE @IsBigDiskOnLastPegYet BIT = 0
SELECT @IsBigDiskOnLastPegYet = COUNT(*)
FROM Games.InstanceVariable
WHERE InstanceVariableNumber = @NumberOfDisks
AND VariableCurrentValue = 3
AND InstanceId = @InstanceId
SELECT @NumberOfDisksOnLastPeg = COUNT(*) FROM Games.InstanceVariable
WHERE InstanceId = @InstanceId AND VariableCurrentValue = 3
-- Check for moves less than 2 raised to number of disks and for all disks
-- on the correct peg. The rules function keeps disks from being put on in
-- the wrong order.
DECLARE @NumberOfMovesLeft INT = @MaxMoves - @NumberOfMoves
IF @NumberOfMoves < POWER(2,@NumberOfDisks)
AND (@NumberOfDisksOnLastPeg = @NumberOfDisks)
BEGIN
SET @ResultVar = 1
END
ELSE BEGIN
DECLARE @MidPointMoves INT = ((@MaxMoves + 1) /2)
IF @NumberOfMoves > @MidPointMoves
BEGIN
IF NOT EXISTS
(SELECT 0 FROM Games.InstanceMove im
INNER JOIN Games.InstanceVariable iv
ON im.InstanceVariableId = iv.InstanceVariableId
AND iv.InstanceVariableNumber = @NumberOfDisks
AND im.InstanceVariableValue = 3
AND im.MoveNumber = @MidPointMoves)
BEGIN
RETURN -1
END
END
IF
(@NumberOfMoves + @NumberOfDisks - @NumberOfDisksOnLastPeg) > @MaxMoves
-- @NumberOfMoves > POWER(2,@NumberOfDisks - @NumberOfDisksOnLastPeg) - 1
-- Big diks must reach be on third peg by half of the remaining moves
OR (@IsBigDiskOnLastPegYet = 0 AND @NumberOfMoves >
(POWER(2,@NumberOfDisks)/2) -1)
-- OR (@NumberOfMovesLeft > (POWER(2,@NumberOfDisks - @NumberOfDisksOnLastPeg)
- POWER(2,@NumberOf))
-- Every other move must be the smallest peg
OR (@TwoMovesBackVariableId <> @LastMoveVariableId
AND @MinVariableId = @TwoMovesBackVariableId AND @PostMoveCheck =1)
-- The big disk must move to the center peg at the midpoint of the game.
BEGIN
SET @ResultVar = -1
END
END
RETURN @ResultVar
END
Ultimately, I’d like to eliminate the use of the functions altogether and “schematize” this. That is, the rules and game status should be represent-able within a generic schema. This would allow the user to be able to create any game through a generic interface rather than requiring the development of a function. Based on the fact that the rules and status are completely based on states of entities and their inter-relationships, it should be possible to create a schema to represent this.
Below are the functions used for the game move/game status. Although these functions are specific to the game, they are defined at the game level and the intent is to execute them dynamically in the game engine based on the game selected. Therefore, we are still able to have a generic framework for doing the game even though we have create domain-specific code.
The below code does cheat a little bit – it has a couple of “hints” to prevent non-productive moves and terminate the game early. For example, if the user doesn’t move (or can’t move) the bottom disk on the middle peg in the middle move, we know that the game is lost. (we define losing as not solving the puzzle in the minimum moves). We also don’t let the same move to be reversed, since again that means the optimal moves can’t be achieved. These hints can be picked up through examining the winning pattern, and my long-term goal is for these to be incorporated into the “hinting” aspect of the simulator rather than having to be defined ahead of time.
REATE FUNCTION [Games].[fn_CheckHanoiMove]
(
-- Add the parameters for the function here
@InstanceId int,
@VariableNumber int,
@VariableValue int,
@LastVariableNumber int = NULL,
@OldVariableValue int = NULL,
@PostCheck bit = 0
)
RETURNS INT
AS
BEGIN
-- Declare the return variable here
-- 0 = Move Invalid
-- -1 = Game over - out of moves
-- 1 = Move Valid
DECLARE @ResultVar INT = 1
DECLARE @NumberOfDisks INT = 0
IF @LastVariableNumber IS NULL
BEGIN
SELECT TOP 1 @LastVariableNumber = InstanceVariableNumber
FROM Games.InstanceMove im
INNER JOIN Games.InstanceVariable iv
ON im.InstanceVariableId
= iv.InstanceVariableId
WHERE im.InstanceId = @InstanceId
ORDER BY MoveNumber DESC
END
IF @OldVariableValue IS NULL
BEGIN
SELECT @OldVariableValue = VariableCurrentValue
FROM Games.InstanceVariable
WHERE InstanceVariableNumber = @VariableNumber
AND InstanceId = @InstanceId
END
-- Represents the peg that we were on before
SELECT @NumberOfDisks = VariableInstanceCount FROM Games.Instance
WHERE InstanceId = @InstanceId
-- Validate that there is not already a peg with the disk
-- which is a lower number disk
-- (Disks are numbered from 1 to N with 1 as the top disk)
IF (SELECT COUNT(*) FROM Games.InstanceMove
WHERE InstanceId = @InstanceId) = (POWER(2,@NumberOfDisks) - 1 )
BEGIN -- No more moves after this one
SET @ResultVar = -1
END
IF (SELECT COUNT(*) FROM Games.InstanceMove
WHERE InstanceId = @InstanceId) > (POWER(2,@NumberOfDisks) - 1 )
BEGIN -- No more moves
SET @ResultVar = -6
END
ELSE BEGIN
-- Check that destination peg does not already have a smaller peg on it
IF ((SELECT COUNT(*) FROM Games.InstanceVariable
WHERE VariableCurrentValue = @VariableValue
AND InstanceVariableNumber< @VariableNumber
AND InstanceId = @InstanceId) > 0)
BEGIN
SET @ResultVar = -2 -- Destination peg has a bigger disk on it
END
ELSE BEGIN
-- Check that this disk was the top disk on the move-from peg
IF ((SELECT COUNT(*) FROM Games.InstanceVariable
WHERE InstanceVariableNumber < @VariableNumber
AND VariableCurrentValue = @OldVariableValue
AND InstanceId = @InstanceId) > 0)
BEGIN
SET @ResultVar = -3
-- Source peg had a smaller disk on it.
END
ELSE BEGIN
IF (@VariableValue = @OldVariableValue)
-- Move must change the position
BEGIN
SET @ResultVar = -4 -- Disk moved to same peg
END
ELSE BEGIN
IF (@LastVariableNumber = @VariableNumber AND NOT
@LastVariableNumber IS NULL)
BEGIN
-- Don't allow nonsense move
-- Moved same piece twice in a row
SET @ResultVar = -5
END
ELSE BEGIN
DECLARE @NumberOfMoves INT = 0
SELECT @NumberOfMoves = COUNT(*)
FROM Games.InstanceMove WHERE InstanceId = @InstanceId
IF @PostCheck = 0 -- For pre-check,
-- increment by one so we are on the right move
BEGIN
SET @NumberOfMoves = @NumberOfMoves + 1
END
DECLARE @MidPointMoves INT =
POWER(2,@NumberOfDisks) / 2
IF @NumberOfMoves = @MidPointMoves
AND (@VariableNumber <> @NumberOfDisks
OR @VariableVAlue <> 3)
BEGIN
-- Only valid move here is the big disk to last peg
SET @ResultVar = -7
END
END
END
END
END
END
-- Return the result of the function
RETURN @ResultVar
END
Regarding these functions, with the exception of the Tower of Hanoi rule check, which we can easily convert to a generic fashion by storing the function name in the game row representing Hanoi and using dynamic SQL, all of the code is generic and not specific to any particular type of game. Let’s take a look at the Windows form used to play the game and play it for a simple 2 disk scenario using brute force. In this scenario, we only “lose” the game once before having to backup. You’ll see as we increase the number of disks on the below graphics, that the loss rate increases very rapidly for games with higher number of disks.
At this point, we now have an interface that allows us to play a game (although it is somewhat tailored to Hanoi and requires a little bit more work to make truly generic) and record the results in state data. Given the above input along with the trigger interaction, here is the data generated including the states of the disks and the pegs as well as overall game states using various variable configuration (2 disks, 3 disks, etc.). Note the patterns. The domain states refer to which variable was moved on a move and the range states indicate which peg was altered for a move. For example for the 2-disk scenario, on Move 1, we moved Disk 1 to Peg 3, On Move 2, we moved Disk 2 to Peg 3 and on Move 3, we moved Disk 1 to Peg 3.
Above are the results of 3 to 5 disk scenarios. Remember, the game is being solved through brute-force at this point. We are just trying to simulate different variations in order to find a pattern or general algorithm for solving the game. All the application understands is how to call a stored procedure to make a rule based on another stored procedure that provides a list of all the legal rules, another procedure to find out the status and determine if a game is still winnable or if it has been won, and finally a stored procedure that can be called to back out a move and try again without having to start all over. Somewhere in the list is the required rule, as there is only 1 “winning” solution to any Hanoi game that uses the fewest moves. Using this approach, along with adding a couple of hints to prevent obvious dead-ends game-plays or detect losses half-way through, we can practically simulate up to 6 disks before the time complexity involved with the recursive depth-searches becomes overwhelming.
Below is the main code in the c# program for playing the game. As you can see, there is nothing specific to Hanoi, this could be used to try to play any game and find the solution by brute force.
public MoveStatus(fn_GetValidHanoiMoves_View move, bool used)
{
Move = move;
Used = used;
}
private List<MoveStatus> getMoveStatuses(int instanceId)
{
List<MoveStatus> moveStatuses = new List<MoveStatus>();
{
var gameMoves = from p in _dc.fn_GetValidHanoiMoves(instanceId)
select p;
foreach (fn_GetValidHanoiMoves_View gameMove in gameMoves)
{
// Add them to the list
moveStatuses.Add(new MoveStatus(gameMove, false));
}
//_dc.SubmitChanges();
}
return moveStatuses;
}
private void move(List<MoveStatus> moveStatuses)
{
int? gameStatus = 0;
Stack<List<MoveStatus>> moveStack = new Stack<List<MoveStatus>>();
while (gameStatus != 1)
{
// Create a new list on the stack unless we are in undo mode
if (moveStack.Count() >= Math.Pow(2.00,
(float)this.numericUpDownObjects.Value))
{
// Went too far, roll back the stack
_dc.UndoLastMove(_InstanceId);
moveStatuses = moveStack.Pop();
}
bool madeMove = false;
// Try a move until one works
foreach (MoveStatus moveStatus in moveStatuses)
{
if (!moveStatus.Used)
{
int? lastMoveId = 0;
moveStatus.Used = true;
_attempts++;
_dc.InsertInstanceMove(ref lastMoveId, ref gameStatus,
_InstanceId,
moveStatus.Move.InstanceVariableId,
moveStatus.Move.InstanceVariableValue);
this.listBox1.Items.Add("Move " +
moveStatus.Move.InstanceVariableNumber.ToString() +
" to "
+ moveStatus.Move.InstanceVariableValue.ToString()
+ "; stack=" + moveStack.Count().ToString());
_dc.SubmitChanges();
// if move fails, undo it.
if (gameStatus == -1)
{
_losses++;
_dc.UndoLastMove(_InstanceId);
this.listBox1.Items.Add("Undo; stack="
+ moveStack.Count().ToString());
_dc.SubmitChanges();
}
// If move was ok or game-won then break out
else
{
madeMove = true;
break;
}
}
}
// If a move worked, we can move on, else we need to backtrack
if (gameStatus == 0 && madeMove)
{
moveStack.Push(moveStatuses);
moveStatuses = (getMoveStatuses(_InstanceId.Value));
}
if (gameStatus == -1 || !madeMove || moveStatuses.Count()== 0)
{
// If game status is loosing, or we weren't able to make
// a move
// or the move status count (generated in line just above)
// is 0, then we undo and pop the stack back.
_dc.UndoLastMove(_InstanceId);
this.listBox1.Items.Add("Undo; stack="
+ moveStack.Count().ToString());
_dc.SubmitChanges();
moveStatuses = (moveStack.Pop());
// Go back to the prior move statuses
}
}
}
Let’s take a closer look at domain states and range states to understand their significance:
|
Disks
|
DomainStates
|
RangeStates
|
|
2
|
1,2,1
|
2,3,3
|
|
3
|
1,2,1,3,1,2,1
|
3,2,2,3,1,3,3
|
|
4
|
1,2,1,3,1,2,1,4,1,2,1,3,1,2,1
|
2,3,3,2,1,2,2,3,3,1,1,3,2,3,3
|
|
5
|
1,2,1,3,1,2,1,4,1,2,1,3,1,2,1,5,1,2,1,3,1,2,1,4,1,2,1,3,1,2,1
|
3,2,2,3,1,3,3,2,2,1,1,2,3,2,2,3,1,3,3,1,2,1,1,3,3,2,2,3,1,3,3
|
Based on the patterns and their interrelationships, we can determine what a solution would be for N disks through some mathematical introspection. We should be able to do this generically in code. For example, note that where N is the number of disks, N is always moved on the T/2 moves where T is the total number of moves required for the optimal solution. Note also that there is symmetry in the domain states before and after the middle move – i.e. for disk-2 scenario, the same disk is moved on the first and final third move, for 3-disk scenario, we have 1,2,1, both before and after the middle-move and this pattern continues for the 4 and 5 disk scenarios. Thus we can generate a hint for an optimal solution to automatically replay the first half of the game when trying to solve for a greater number of disks. There are additional patterns in the range states and how they relate to the domains. By simply finding all of the patterns, one should be able to define the winning pattern for any number of disks without even knowing the recurrence relation.
Another way to look at this is from the point of view of the variable states, which my infrastructure provides through the code previously shown in the trigger. It just involves appending a 1 or 0 on to the bit mask based on whether or not the variable was affected.
UPDATE Games.InstanceVariableValueStates
SET MoveStates = COALESCE(MoveStates,'') + '1'
WHERE InstanceId = @InstanceId
AND InstanceVariableNumber = @InstanceVariableNumber
AND InstanceVariableValue= @InstanceVariableValue
UPDATE Games.InstanceVariableValueStates
SET MoveStates = COALESCE(MoveStates,'') + '0'
WHERE InstanceId = @InstanceId
AND NOT (InstanceVariableNumber = @InstanceVariableNumber
AND InstanceVariableValue= @InstanceVariableValue)
At the end of the game, we end up with the following. the bit mask just indicates what move the disk is moved to a particular peg. For example in instance 200, the 3 disk game, we find that disk 2 is moved onto disk 3 on the second move.
|
Instance
|
Variable
|
Value
|
Bit Mask (each bit is a move, and a 1 indicate a pairing of the variable (disk) and the value (peg).
|
|
199
|
1
|
1
|
000 - Disk 1 never moved to peg 1 (started there, but never moved there)
|
|
199
|
1
|
2
|
100 - Disk 1 moved to disk 2 on move 1
|
|
199
|
1
|
3
|
001 - Disk 1 moved to disk 3 on move 3 (last move)
|
|
199
|
2
|
1
|
000
|
|
199
|
2
|
2
|
000
|
|
199
|
2
|
3
|
010– Disk 2 moved onto peg 3 on move 2
|
|
200
|
1
|
1
|
0000100
|
|
200
|
1
|
2
|
0010000
|
|
200
|
1
|
3
|
1000001
|
|
200
|
2
|
1
|
0000000
|
|
200
|
2
|
2
|
0100000
|
|
200
|
2
|
3
|
0000010
|
|
200
|
3
|
1
|
0000000
|
|
200
|
3
|
2
|
0000000
|
|
200
|
3
|
3
|
0001000
|
|
201
|
1
|
1
|
000010000010000
|
|
201
|
1
|
2
|
100000100000100
|
|
201
|
1
|
3
|
001000001000001
|
|
201
|
2
|
1
|
000000000100000
|
|
201
|
2
|
2
|
000001000000000
|
|
201
|
2
|
3
|
010000000000010
|
|
201
|
3
|
1
|
000000000000000
|
|
201
|
3
|
2
|
000100000000000
|
|
201
|
3
|
3
|
000000000001000
|
|
201
|
4
|
1
|
000000000000000
|
|
201
|
4
|
2
|
000000000000000
|
|
201
|
4
|
3
|
000000010000000
|
|
202
|
1
|
1
|
0000100000100000100000100000100
|
|
202
|
1
|
2
|
0010000010000010000010000010000
|
|
202
|
1
|
3
|
1000001000001000001000001000001
|
|
202
|
2
|
1
|
0000000001000000000001000000000
|
|
202
|
2
|
2
|
0100000000000100000000000100000
|
|
202
|
2
|
3
|
0000010000000000010000000000010
|
|
202
|
3
|
1
|
0000000000000000000100000000000
|
|
202
|
3
|
2
|
0000000000010000000000000000000
|
|
202
|
3
|
3
|
0001000000000000000000000001000
|
|
202
|
4
|
1
|
0000000000000000000000000000000
|
|
202
|
4
|
2
|
0000000100000000000000000000000
|
|
202
|
4
|
3
|
0000000000000000000000010000000
|
|
202
|
5
|
1
|
0000000000000000000000000000000
|
|
202
|
5
|
2
|
0000000000000000000000000000000
|
|
202
|
5
|
3
|
0000000000000001000000000000000
|
|
203
|
1
|
1
|
000010000010000010000010000010000010000010000010000010000010000
|
|
203
|
1
|
2
|
100000100000100000100000100000100000100000100000100000100000100
|
|
203
|
1
|
3
|
001000001000001000001000001000001000001000001000001000001000001
|
|
203
|
2
|
1
|
000000000100000000000100000000000100000000000100000000000100000
|
|
203
|
2
|
2
|
000001000000000001000000000001000000000001000000000001000000000
|
|
203
|
2
|
3
|
010000000000010000000000010000000000010000000000010000000000010
|
|
203
|
3
|
1
|
000000000000000000010000000000000000000000010000000000000000000
|
|
203
|
3
|
2
|
000100000000000000000000000100000000000000000000000100000000000
|
|
203
|
3
|
3
|
000000000001000000000000000000000001000000000000000000000001000
|
|
203
|
4
|
1
|
000000000000000000000000000000000000000100000000000000000000000
|
|
203
|
4
|
2
|
000000000000000000000001000000000000000000000000000000000000000
|
|
203
|
4
|
3
|
000000010000000000000000000000000000000000000000000000010000000
|
|
203
|
5
|
1
|
000000000000000000000000000000000000000000000000000000000000000
|
|
203
|
5
|
2
|
000000000000000100000000000000000000000000000000000000000000000
|
|
203
|
5
|
3
|
000000000000000000000000000000000000000000000001000000000000000
|
|
203
|
6
|
1
|
000000000000000000000000000000000000000000000000000000000000000
|
|
203
|
6
|
2
|
000000000000000000000000000000000000000000000000000000000000000
|
|
203
|
6
|
3
|
000000000000000000000000000000010000000000000000000000000000000
|
Examining these patterns bit-by-bit for shift operations, power-functions, etc, may be a more practical way to decipher the winning patterns:
My goal is that the code should be able to find patterns, store the patterns in a dictionary and then consult a dictionary of patterns learned over time including patterns related to how to learn, match it up to the states based on how the variables change, and identify a recursive relation to solve for future games, not by using brute force, but by using the learned rules. As time goes on, we can add to the dictionary new algorithms and then use them to look for patterns. Therefore, this can be the basis for an extensible/self-learning framework.
Now, that we have captured the data states of the game, we can examine the game states as they relate to the number of variables. The engine for examining and learning the patterns can be the same engine that does the simulation, thus this can be a recursive self-learning organism feeding off of achieving goals within the constraints of rules. We should be able to apply this technique to any scenario, including complex ones by schematizing the scenario into a game-based schema. Since we can do that for Hanoi, although this is a very simple case, my thesis is that the same fundamental approach can be used for other scenarios, which may be much more complex.
I apologize that some of the code isn’t formatted nicely and wrapped strangely in some places. I used a handy add-in for inserting code (Insert Code from http://www.shahine.com/omar/InsertCodeForWindowsLiveWriter.aspx), but some of my code lines are too long and got truncated. Due to time constraints, I just went in and put hard returns where necessary in the blog post to make sure everything at least displays on the page.
If you feel like I left you hanging, it’s because I did. I will have more to post on this later after I figure out the recognition piece. Stay tuned…
For today, here’s a simple trick. Ever need to get a list of all the calendar dates for a period? This is very simple using a user defined function with a table. Below is a simple version. I have a more complex version I am using for my application that filters based on another table containing holidays, etc.
-- =============================================
-- Author: Bob Leithiser
-- Create date: 7/5/2009
-- Description: Return a list of dates in table format for a specified date range
-- =============================================
CREATE FUNCTION [Util].[udf_GetCalendarDates]
(
@StartDate DATE,
@EndDate DATE
)
RETURNS
@CalendarDates TABLE
(
-- Add the column definitions for the TABLE variable here
CalendarDate DATE
)
AS
BEGIN
-- Fill the table variable with the rows for your result set
DECLARE @CalendarDate DATE = @StartDate
WHILE @CalendarDate <= @EndDate
BEGIN
BEGIN
INSERT INTO @CalendarDates (CalendarDate) VALUES (@CalendarDate)
END
SET @CalendarDate = DATEADD(DD,1,@CalendarDate)
END
RETURN
END
GO
select * from util.udf_GetCalendarDates('20090101','20090130')
and now we get:
CalendarDate
2009-01-01
2009-01-02
2009-01-03
2009-01-04
2009-01-05
2009-01-06
2009-01-07
2009-01-08
2009-01-09
2009-01-10
2009-01-11
2009-01-12
2009-01-13
2009-01-14
2009-01-15
2009-01-16
2009-01-17
2009-01-18
2009-01-19
2009-01-20
2009-01-21
2009-01-22
2009-01-23
2009-01-24
2009-01-25
2009-01-26
2009-01-27
2009-01-28
2009-01-29
2009-01-30
Here’s an example where I found this useful. I needed to generate a list of holidays to avoid processing of days not containing any business activity data as part of an analytic/simulation application.
insert into Load.Holiday (ExchangeName,HolidayDate)
select '*', CalendarDate
from util.udf_GetCalendarDates('2007-05-01','2009-07-04')
where not exists
(select distinct marketdate from dbo.EquityHistory where MarketDate = CalendarDate )
and DATEPART(WEEKDAY,CalendarDate) NOT IN (1,7)
select * from Load.holiday
(20 row(s) affected)
ExchangeName HolidayDate
* 2007-05-28
* 2007-07-04
* 2007-09-03
* 2007-11-22
* 2007-12-25
* 2008-01-01
* 2008-01-21
* 2008-02-18
* 2008-03-21
* 2008-05-26
* 2008-07-04
* 2008-09-01
* 2008-11-27
* 2008-12-25
* 2009-01-01
* 2009-01-19
* 2009-02-16
* 2009-04-10
* 2009-05-25
* 2009-07-03
Technorati Tags:
SQL Server,
Tips If you’re like me and spend a lot of time in SQL Query Analyzer, querying data directly, you may find the column display format tedious for tables with lots of columns or where you are only working with a couple of entries in the table anyways.
Consider the following data that I was just trying to dump out as part of another blog post related to my doctoral research on automated software.
That’s not too bad, but I only have 5 columns. What if I have a lot more as in
select * from HumanResources.Employee Where EmployeeId = 1 using adventureworks database:
Those are just the first few columns, Unless you have a 30 inch wide screen with 3000 pixels across, you still won’t be able to see everything across the width of the screen without scrolling.
Wouldn’t it be nice if we could just do something like exec util_PivotAllColumns
EXEC [dbo].[util_PivotAllColumns]
@FromSpecifier = N'Person.Contact',
@AfterFromClause = 'WHERE ContactId = 1',
@ColumnList = '*',
@PrintSelectStatement = 1
and have SQL like below generated automatically,
SELECT 0 AS ColSeq, 'EmployeeID' AS ColName, CONVERT(NVARCHAR(MAX),[EmployeeID]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 1 AS ColSeq, 'NationalIDNumber' AS ColName, CONVERT(NVARCHAR(MAX),[NationalIDNumber]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 2 AS ColSeq, 'ContactID' AS ColName, CONVERT(NVARCHAR(MAX),[ContactID]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 3 AS ColSeq, 'LoginID' AS ColName, CONVERT(NVARCHAR(MAX),[LoginID]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 4 AS ColSeq, 'ManagerID' AS ColName, CONVERT(NVARCHAR(MAX),[ManagerID]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 5 AS ColSeq, 'Title' AS ColName, CONVERT(NVARCHAR(MAX),[Title]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 6 AS ColSeq, 'BirthDate' AS ColName, CONVERT(NVARCHAR(MAX),[BirthDate]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 7 AS ColSeq, 'MaritalStatus' AS ColName, CONVERT(NVARCHAR(MAX),[MaritalStatus]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 8 AS ColSeq, 'Gender' AS ColName, CONVERT(NVARCHAR(MAX),[Gender]) AS ColValue FR
OM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 9 AS ColSeq, 'HireDate' AS ColName, CONVERT(NVARCHAR(MAX),[HireDate]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 10 AS ColSeq, 'SalariedFlag' AS ColName, CONVERT(NVARCHAR(MAX),[SalariedFlag]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 11 AS ColSeq, 'VacationHours' AS ColName, CONVERT(NVARCHAR(MAX),[VacationHours]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 12 AS ColSeq, 'SickLeaveHours' AS ColName, CONVERT(NVARCHAR(MAX),[SickLeaveHours]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 13 AS ColSeq, 'CurrentFlag' AS ColName, CONVERT(NVARCHAR(MAX),[CurrentFlag]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 14 AS ColSeq, 'rowguid' AS ColName, CONVERT(NVARCHAR(MAX),[rowguid]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 15 AS ColSeq, 'ModifiedDate' AS ColName, CONVERT(NVARCHAR(MAX),[ModifiedDate]) AS ColValue
FROM HumanResources.Employee WHERE EmployeeId = 1
UNION ALL SELECT 1 AS ColSeq, 'NameStyle' AS ColName, CONVERT(NVARCHAR(MAX),[NameStyle]) AS ColValue
FROM Person.Contact
So, we could see our data like this directly from Query Analyzer
Enter the util_PivotAllColumns stored proc:
/****** Object: StoredProcedure [dbo].[util_PivotAllColumns] Script Date: 06/16/2009 16:43:29 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: Bob Leithiser
-- Create date: 6/16/2009
-- Description: Pivots all columns from a table and selects primary key value if specified as a parameter
-- WARNING: This isn't safe from SQL Injection, not to be used for production, just a testing/dumping tool.
-- Uses sys.columns view from current database where the stored proc is located, so you must create in every
-- database that you want to use this in.
-- =============================================
CREATE PROCEDURE [dbo].[util_PivotAllColumns]
@FromSpecifier NVARCHAR(MAX),
@AfterFromClause NVARCHAR(MAX) = NULL, -- typically the where clause, but make it flexible for group by, etc.
-- Downside is that this means user must supply the WHERE keyword rather than just the criteria.
@ColumnList NVARCHAR(MAX) = '*',
@PrintSelectStatement BIT = 0
AS
BEGIN
/* Usage Example:
EXEC [dbo].[util_PivotAllColumns]
@FromSpecifier = N'Person.Contact',
@AfterFromClause = 'WHERE ContactId = 1',
@ColumnList = '*',
@PrintSelectStatement = 1
-- You can omit the optional parameters and for a quick table dump just do:
EXEC [dbo].[util_PivotAllColumns] 'Tablename'
*/
-- TODO: Validate input parameters and add try/catch exception handling
-- TODO: Support multiple tables in the FROM clause
-- TODO: Add parsing to support column list
SET NOCOUNT ON
-- Assuming just a single table at this point, not parsing for multiple
-- DECLARE @TableName SYSNAME = OBJECT_NAME(OBJECT_ID(@FromSpecifier))
-- DECLARE @SchemaName SYSNAME = OBJECT_SCHEMA_NAME(OBJECT_ID(@FromSpecifier))
-- Get the column list if not provided
IF COALESCE(@ColumnList,'*') = '*'
BEGIN
DECLARE @SQLCmd NVARCHAR(MAX)
DECLARE @ColumnName SYSNAME
DECLARE ColumnCursor CURSOR FOR
SELECT [name]
FROM sys.columns
WHERE object_id = OBJECT_ID(@FromSpecifier) -- Need to add parsing for multiple tables, joins, etc.
OPEN ColumnCursor
FETCH NEXT FROM ColumnCursor INTO @ColumnName
DECLARE @ColOrder INT = 0
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @ColumnSpecifier NVARCHAR(2000) = N''
-- Once past first column, tack on UNION ALL
IF @ColOrder > 0
SET @ColumnSpecifier = N' UNION ALL SELECT '
ELSE SET @ColumnSpecifier = N'SELECT '
-- Add the column SEQuencer
SET @ColumnSpecifier =
@ColumnSpecifier + CONVERT(NVARCHAR(2000),@ColOrder) + N' AS ColSeq, '
-- 2000 columns ought to be enougn
-- Add the column NAME
SET @ColumnSpecifier = @ColumnSpecifier + N'''' + @ColumnName + N''' AS ColName, '
-- Add the column VALUE - Have to convert to same type - use nvarchar - so all the unions get along
SET @ColumnSpecifier = @ColumnSpecifier + N'CONVERT(NVARCHAR(MAX),[' + @ColumnName + '])
AS ColValue'
-- Add the FROM clause and AFTER From Clause (typically the WHERE clause - must include WHERE)
SET @ColumnSpecifier = @ColumnSpecifier + N' FROM ' + @FromSpecifier + ' ' +
COALESCE(@AfterFromClause,'')
-- Add the column specifier to the SQL Command String and toss in c/r l/f to make more source query
-- readable
SET @SQLCmd = COALESCE(@SQLCmd,N'') + CHAR(13) + CHAR(10) + @ColumnSpecifier
-- Increment the column sequencer
SET @ColOrder = @ColOrder + 1
FETCH NEXT FROM ColumnCursor INTO @ColumnName
END
CLOSE ColumnCursor
DEALLOCATE ColumnCursor
END
ELSE BEGIN
-- Parse the column list and do inline replacements
PRINT 'Sorry, I dont parse column lists yet'
END
SELECT @SQLCmd -- For debugging
IF @PrintSelectStatement = 1
BEGIN
PRINT @SQLCmd
END
EXEC sp_ExecuteSQL @stmt = @SQLCmd
END
GO
I asked around a little and found out about some neat dynamic SQL generators for pivoting, see
http://www.sommarskog.se/pivot_sp.sp
http://www.sqlteam.com/article/dynamic-cross-tabs-pivot-tables
but these were overkill for what I needed, plus I wanted something quick and easy to use without having to think about how to summarize the data.
codeproject
I promised last time that I would follow up with some code to provide a proof of concept for using bit masks for date patterns. I have good news! I have it working in my simulation application and have prepared a little demo for how this works with a Windows form.
I found it wasn’t too difficult to use .NET BitArrays and convert to varbinary and also was able to do this using SQL-to-LINQ. The tricky part was writing a table-valued function that could actually interpret the bit mask and return back all the associated dates. This would probably be better done via SQLCLR with a .NET function, but I’m out of practice with doing .NET CLR functions so will put that off for another iteration.
So, first of all here’s a screen shot of the testing form.
The goal of the form was just to prototype the concept by allowing the specification of a day of week pattern to repeat for a particular period. In the above example, I wanted to generate a bit pattern that would contain 14 bits (actually 16 bits -8 bits for each week - as I explain later) representing whether an event occurred on a given day offset from Monday, June 22 and then store the data in a database.
Here’s the table definition I used to store the data:
CREATE TABLE [dbo].[TestBitArray](
[PeriodStartDate] [date] NOT NULL,
[PeriodLength] [smallint] NOT NULL,
[EventName] [varchar](50) NOT NULL,
[SelectedWeekDayMask] [tinyint] NOT NULL,
[DateRangeBitArray] [varbinary](50) NOT NULL,
CONSTRAINT [PK_TestBitArray] PRIMARY KEY CLUSTERED
(
[PeriodStartDate] ASC,
[PeriodLength] ASC,
[EventName] ASC
))
Here’s what ends up in the database
| PeriodStartDate |
PeriodLength |
EventName |
SelectedWeekDayMask |
DateRangeBitArray |
| 6/22/2009 |
14 |
TestEvent1 |
74 |
0x4A4A |
Please excuse me for using a compound primary key, this is just a quick and dirty example! I wanted to test with different combinations of period start dates, lengths, and event names.
Just to contrast different ways to deal with bits, I decided to store my selection bit mask as a simple tinyint rather than as a byte. I use tinyint, since the max number of bits that can be set for a week is 7 for each day, which works out to 127 maximum value using low order bits.
Without getting too bogged down, (full code is in the attachment) here’s what the .NET code fragment looks like to calculate the bit mask and display the items in the list box.
BitArray ba = new BitArray((((periodDays-1) / 7) * 8) + 8 , false);
this.listBoxResults.Items.Clear();
DateTime currentDate = periodStartDate;
int j = 0;
for (int i = 0; i < periodDays; i++)
{
if ((i % 7) == 0)
{
j++; // Don't set trailing bit, so we store 7 days per byte - simplifies comparisons
}
currentDate = periodStartDate.Add(new TimeSpan(i, 0, 0, 0));
int currentDayOfWeekIndex = (int)currentDate.DayOfWeek;
ba.Set(i+j, (checkForDayOfWeek(currentDate.DayOfWeek)));
this.listBoxResults.Items.Add(currentDate.ToShortDateString() + " : " + ba[i+j].ToString());
}
_ba = ba;
Note, the extra bit not being used so that each week is allocated to 1 byte, so 2 weeks of data is actually 14 bytes. I decided to go ahead and “waste” a bit so that each week is in a single byte. This makes it easier to look at the data byte-by-byte, which is the only way to natively view it from T-SQL and see what the patterns are for each week. I could have just strung them all together, but didn’t think it worth the space savings and it’s not too hard to work around in the functions on both the .NET and SQL side.
Storing to the database is a bit tricky and requires some helper functions that are included in the attachment.
TestBitArrayDataDataContext dc = new TestBitArrayDataDataContext();
dc.UpsertTestBitArray(
this.dateTimePickerStartDate.Value,
(short)this.numericUpDownPeriodLength.Value,
BitArrayConversions.ConvertBitMaskToByte(selectedDaysMask),
BitArrayConversions.BitArrayToByteArray(_ba),this.textBoxEvent.Text);
dc.SubmitChanges();
It is simple to read integer-type (tinyint, smallint, integer) data type into a bit array as well as to load from an array of varbinary.
For the selected days bit mask:
BitArray selectedDays = new BitArray(System.BitConverter.GetBytes(tb.SelectedWeekDayMask));
For the computed period detailed history bitmap:
_ba = new BitArray(tb.DateRangeBitArray.ToArray());
Now, the fun part is the SQL function to return the data, again for brevity, I’ve not pasted the supporting functions but are included in the TestBitArray.sql attachment in the zip.
CREATE FUNCTION [dbo].[udf_GetDatesForBitMask]
(
-- Add the parameters for the function here
@PeriodStartDate date,
@DateBitMask varbinary(8000)
)
RETURNS
@PeriodDates TABLE
(
EventDate DateTime
)
AS
BEGIN
-- Fill the table variable with the rows for your result set
DECLARE @ByteLength smallint = LEN(@DateBitMask)
DECLARE @ByteString varchar(8000) = master.dbo.fn_varbintohexstr(@DateBitMask)
DECLARE @BytePointer smallint = 0
DECLARE @EventDate datetime
WHILE @BytePointer < @ByteLength
BEGIN
DECLARE @ByteValue tinyint = dbo.fn_ConvertHexUnsignedCharToTinyInt(SUBSTRING(@ByteString,3+@BytePointer*2,2))
DECLARE @BitPower smallint = 7
WHILE @BitPower > 0 -- First bit not used
BEGIN
DECLARE @BitTestValue smallint = POWER(2.00,@BitPower)
IF @ByteValue >= @BitTestValue
BEGIN
SET @EventDate = DATEADD(DD,@BitPower - 1 + @BytePointer * 7 ,@PeriodStartDate)
SET @ByteValue = @ByteValue - @BitTestValue
INSERT @PeriodDates (EventDate) Values (@EventDate)
END
SET @BitPower = @BitPower -1
END
SET @BytePointer = @BytePointer + 1
END
RETURN
END
The ConvertHexUnsignedCharToTinyInt function is based on a similar function for converting to varbinary from string at http://sqlblog.com/blogs/peter_debetta/archive/2007/03/09/t-sql-convert-hex-string-to-varbinary.aspx
So, what do we get when we actually query the table and join with the function?
/****** Script for SelectTopNRows command from SSMS ******/
SELECT [EventName]
,[PeriodStartDate]
,[PeriodLength]
,[Eventdate]
,master.dbo.fn_varbintohexstr(DateRangeBitArray) as BitArray
FROM [dbo].[TestBitArray]
CROSS APPLY dbo.udf_GetDatesForBitMask(PeriodStartDate,DateRangeBitArray)
Where EventName = 'TestEvent1'
Order by PeriodStartDate, Eventdate
|
EventName
|
PeriodStartDate
|
PeriodLength
|
Eventdate
|
BitArray
|
|
TestEvent1
|
6/22/2009
|
14
|
6/22/2009
|
0x4a4a
|
|
TestEvent1
|
6/22/2009
|
14
|
6/24/2009
|
0x4a4a
|
|
TestEvent1
|
6/22/2009
|
14
|
6/27/2009
|
0x4a4a
|
|
TestEvent1
|
6/22/2009
|
14
|
6/29/2009
|
0x4a4a
|
|
TestEvent1
|
6/22/2009
|
14
|
7/1/2009
|
0x4a4a
|
|
TestEvent1
|
6/22/2009
|
14
|
7/4/2009
|
0x4a4a
|
Voila, we’ve stored 2 weeks of history for an event in a single row of a table at a cost of just 2 bytes and we are able to query it with T-SQL and no .CLR by joining to a table valued function. A month could be stored in 5 bytes and whole year’s worth of history could be stored in 53 bytes.
For our example,l we repeated the same days for each week, but that is just a limitation of our test harness, we could have stored a different pattern of bits within each byte to represent dates.
Later, I’ll show more practical value with the simulation example including fuzzy logic matching to show for example what stocks traded together with statistical confidence within a particular period of time, by merely comparing bit masks enumerating a few hundred summary rows instead of enumerating through hundreds of thousands of history rows. The main application I see for this design pattern is analytic applications that want to be able to rapidly compare correlations of entities based on events associated with periods over time. This is also useful to reduce storage requirements and I/O performance if there is a huge mass of history that is linked to a specific event during a period where you only need to track if the event occurred and not the details of the event itself.
I’ve included an attachment with all of the code, if you wish to try it out.
I guess my title sounds more like a subject for an academic paper, which I may actually do in time, but in any case, the use of bitmasks for represent events occurring over a fixed period of time has powerful implications for detailed correlation with minimal data storage. Previously I write about the use of periods to link detailed history data to summary data periods to help with analysis. Since then, I’ve evolved the design from an indexed view to a static period summary table, which I’ll talk about in a later post.
However, for this post, I will focus on functionality possible with bitmasks within a fixed period of time. In our scenario, we are testing the results of shorting/longing strategies by testing them against history. It is a robot-type implementation where the software doesn’t know what the actual is and then it executes a trading algorithm with variable parameters and then recording the final outcome. I don’t record the detailed results because the storage costs and processing time would be unrealistic. For example, even with a 200% ROI, using short and long strategies, there are over 12 million strategies meeting the criteria against a list of 5000 stocks on AMEX/NYSE against the 30 or so periods for the last year (periods being defined as rolling 1, 3, 6, and 12 months). Adding details outlining the execution of each transaction involved in the simulation would multiple this several times, based on the fact that the most effective strategies have involved frequent trading.
But, it would be very handy to be able to pull up the results of a simulation and actually see what dates were involved in the execution for drill-down reporting, especially if the equity was one of special interest and being profiled (more about that later also). However, the business value for this comes in with being able to do correlative parallel analysis. This allows for correlating at the detail level in order to establish latencies between events. For example, GM recently went broke (yes, Obama really isn’t going to bail out GM, at least not the “old one), the current common stock is worth nothing and their creditors may not be able to collect on the debts. So, we see a potential cascading effect between the stock action of a failing company and it’s creditors. Over the past months, this effect was evident in the market if you look at GM’s suppliers.
What would be interesting would be to see if these kinds of cascading events are really priced into the market – i.e., will the stock value of GM’s suppliers continue to drop even after GM is bankrupt? Is there a delay in when an event occurs that cascades. If so, one could adjust their stock strategy accordingly. On a more positive note, what about if one company’s stock is accelerating and it’s product sales have an effect on related companies, maybe not even in the same sector?
Storing the time series within a fixed period gives us capability to do this analysis to determine latencies. For example, if we store a bit pattern for a 30 day period of when profitable trades were executed. i.e. 00101000… with a start period of April 1, 2009 would represent events occurred on April 3 and April 5 within the period. Let’s say company X had this pattern, but company Y exhibited the same pattern offset by 2 bits – i.e 000010100.. to represent dates April 5 and 7. Using bit-shifting via .NET CLR function we could detect a parallel match. This is way over-simplified, the reality is that overall market pressures would skew these masks some so that they are not exact patterns. However, using “fuzzy” logic, one can check for the differences between the bit masks and determine if there is a correlation based on statistical methods. I.e. if the bit mask for Y offset by 2 shows a density that is within 20 – 30% proximity of the bitmask for X, then we can have a certain degree of confidence based on amount of data sampled per population (confidence interval) that there is a meaningful pattern.
This is a pretty deep topic that I should address in an academic paper, but hopefully that gives you an idea.
That’s all the time I have now. Later, we’ll get into the code involved for actually accomplishing this.
Last time we discussed the use of a period table to consolidate analysis into smaller segments of data. This can be done for an analytical application where the details are not needed or as part of a rollup. Today, we’re going to look at how to use the period table. For the application discussed in part 1, it would be useful to have information related to analyzing the period quickly accessible without having to aggregate all of the history rows. To do this, we create an indexed view with schema binding.
There are several rules for indexed view discussed in the “Creating Indexed Views” article at http://msdn.microsoft.com/en-us/library/ms191432.aspx. This includes not using AVG, MIN, MAX and using COUNT_BIG with SUM in order to derive the averages as well as restrictions requiring the use of ANSI_NULLS, QUOTED_IDENTIFIER as summarized below:
The user that executes the CREATE INDEX statement must be the view owner. The following SET options must be set to ON when the CREATE INDEX statement is executed: - ANSI_NULLS
- ANSI_PADDING
- ANSI_WARNINGS
- CONCAT_NULL_YIELDS_NULL
- QUOTED_IDENTIFIER
The NUMERIC_ROUNDABORT option must be set to OFF. This is the default setting. If the database is running in 80 compatibility mode or earlier, the ARITHABORT option must be set to ON. When you create a clustered or nonclustered index, the IGNORE_DUP_KEY option must be set to OFF (the default setting). The view cannot include text, ntext, or image columns, even if they are not referenced in the CREATE INDEX statement. If the SELECT statement in the view definition specifies a GROUP BY clause, the key of the unique clustered index can reference only columns specified in the GROUP BY clause. So, given the below table and our period table from last time, we create an indexed view as follows. Notice the conversion of the dailyLow and DailyHigh fields to Money (they were stored as decimal in the equity history table). You can’t include any imprecise data in the aggregations, hence the need for the convert. Make sure your query options are correct before executing the indexed view statement:
Underlying Tables:
CREATE TABLE [dbo].[EquityHistory](
[TradingSymbol] [varchar](25) NOT NULL,
[MarketDate] [date] NOT NULL,
[PriceAtClose] [float] NULL,
[Volume] [bigint] NOT NULL,
[DailyHigh] [float] NOT NULL,
[DailyLow] [float] NOT NULL,
[PriceAtOpen] [float] NULL,
[DateUpdated] [datetime2](0) NULL,
CONSTRAINT [PK_EquityHistory] PRIMARY KEY CLUSTERED
(
[TradingSymbol] ASC,
[MarketDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
View Statement:
CREATE VIEW [dbo].[IndexView_EquityHistoryPeriod] with SCHEMABINDING AS
SELECT TradingSymbol, PeriodId, StartDate, EndDate, MonthsInPeriod,
SUM(CONVERT(MONEY, DailyLow, 0)) AS SumPeriodLow, SUM(CONVERT(MONEY, DailyHigh, 0))
AS SumPeriodHigh, SUM(Volume) AS TotalVolume, COUNT_BIG(*) AS TradingDayCount
FROM dbo.EquityHistory INNER JOIN
dbo.Period ON MarketDate BETWEEN StartDate AND EndDate
WHERE (Volume > 0)
GROUP BY TradingSymbol, PeriodId, StartDate, EndDate, MonthsInPeriod
Below is the SQL to build the clustered index.
One thing to watch out for, when altering a view like this, the index is not scripted by default – i.e, if you do modify view or script view as alter, the index will not show up and when you recreate the view, the index is removed.
SET ARITHABORT ON
GO
SET CONCAT_NULL_YIELDS_NULL ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
SET ANSI_PADDING ON
GO
SET ANSI_WARNINGS ON
GO
SET NUMERIC_ROUNDABORT OFF
GO
CREATE UNIQUE CLUSTERED INDEX [IX_IndexedView_EquityHistoryPeriod] ON [dbo].[IndexView_EquityHistoryPeriod]
(
[PeriodId] ASC,
[TradingSymbol] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Recently, I started a personal project for analyzing stock market trends and calculating the optimal trading strategies. I've learned a lot of new stuff about LINQ and SQL 2008 from this. Based on that I'm opening a new series on data warehouse tips.
In my project, one of the things I quickly realized after calculating that I would need to generate about 1 billion rows in order to have sufficient detail for analysis of 5,000 publicly traded stocks with 1 year of history was that I needed to consolidate the data. It was obvious I needed a way to summarize data into periods and use those as dimensions for analyzing the performance, rather than calculating all of this from the detailed data.
What would be nice, would be to have periods that could overlap and align. For example, I want to be able to go back and look at last month, last quarter, last 6 months, and last year. However each one of these requires a different set of transactions because even though the closing date is the same, the opening date is different. The variation in open date affects the price at which a stock would be purchased at (or short-sold at), thus leading to a different set of transactions based on an entry/exit strategy. In order to focus on the technical aspect of this, I won't get into all of the application design considerations. Suffice to say, that discrete, yet aligned periods were needed.
So I immediately wrote a stored procedure (final version below) to try to insert my periods and decided to leverage the merge capability.
Below is the table definition of the Period Table
CREATE TABLE [dbo].[Period](
[PeriodId] [smallint] NOT NULL,
[StartDate] [date] NOT NULL,
[EndDate] [date] NOT NULL,
[MonthsInPeriod] AS (datediff(month,[StartDate],dateadd(day,(1),[EndDate]))),
CONSTRAINT [PK_Period] PRIMARY KEY CLUSTERED
(
[PeriodId] ASC
)
) ON [PRIMARY]
Note, that we can calculate the number of months in the period (assumes all periods are in 1 month increments) easily using a computed field with datediff
And the Merging (upsert) stored procedure
CREATE PROCEDURE [dbo].[UpsertMonthlyPeriod]
@PeriodDate date,
@Months int
AS BEGIN
DECLARE @EndDate date
SET @EndDate = DATEADD(DD,-1,DATEADD(MM,@Months, @PeriodDate))
MERGE INTO Period AS Target
USING (VALUES (@PeriodDate, @EndDate)) AS Source (NewStartDate, NewEndDate)
ON Target.StartDate = Source.NewStartDate
AND Target.EndDate =Source.NewEndDate
WHEN NOT MATCHED BY TARGET THEN
INSERT (StartDate, EndDate)
VALUES (NewStartDate, NewEndDate);
END
This is very simple example of the new Merge functionality. The USING portion identifies the values to use for mapping the source data to the target, which in this case was simply start-date and end-date (remember I am basically allowing overlapping periods for analysis purposes).
The below stored procedure then actually generates the periods by calling the Upsert stored proc:
CREATE PROCEDURE [dbo].[SetupPeriods]
@StartDate date,
@Periods int,
@CreateCummulative bit
AS BEGIN
DECLARE @x int = 0
DECLARE @PeriodDate date
SET @PeriodDate = @StartDate
WHILE @x < @Periods
BEGIN
-- Create Monthly periods
DECLARE @PeriodMonths int
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=1
-- Create Quarerly periods
IF @x % 3 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=3
END
-- Create Bi-Annual periods
IF @x % 6 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=6
END
-- Create Annual periods
IF @x % 12 = 0 OR @CreateCummulative =1
BEGIN
EXEC dbo.UpsertMonthlyPeriod @PeriodDate = @PeriodDate, @Months=12
END
SET @x = @x+1
SET @PeriodDate = DATEADD(MM,@x,@StartDate)
END
END
The end result is the data shown below:
PeriodId StartDate EndDate MonthsInPeriod
8041 2008-04-01 2008-04-30 1
8043 2008-04-01 2008-06-30 3
8046 2008-04-01 2008-09-30 6
8040 2008-04-01 2009-03-31 12
8051 2008-05-01 2008-05-31 1
8053 2008-05-01 2008-07-31 3
8056 2008-05-01 2008-10-31 6
8050 2008-05-01 2009-04-30 12
8061 2008-06-01 2008-06-30 1
8063 2008-06-01 2008-08-31 3
8066 2008-06-01 2008-11-30 6
8071 2008-07-01 2008-07-31 1
8073 2008-07-01 2008-09-30 3
8076 2008-07-01 2008-12-31 6
8081 2008-08-01 2008-08-31 1
8083 2008-08-01 2008-10-31 3
8086 2008-08-01 2009-01-31 6
8091 2008-09-01 2008-09-30 1
8093 2008-09-01 2008-11-30 3
8096 2008-09-01 2009-02-28 6
8101 2008-10-01 2008-10-31 1
8103 2008-10-01 2008-12-31 3
8106 2008-10-01 2009-03-31 6
8111 2008-11-01 2008-11-30 1
8113 2008-11-01 2009-01-31 3
8116 2008-11-01 2009-04-30 6
8121 2008-12-01 2008-12-31 1
8123 2008-12-01 2009-02-28 3
9011 2009-01-01 2009-01-31 1
9013 2009-01-01 2009-03-31 3
9021 2009-02-01 2009-02-28 1
9023 2009-02-01 2009-04-30 3
9031 2009-03-01 2009-03-31 1
9041 2009-04-01 2009-04-30 1
...
10011 2010-01-01...
..
But, wait! Where did the PeriodId come from? It's not an identity field and the numbers appear to follow a pattern.
Can you guess the pattern?
...
Yes, that's right - last 2 digits of year + 2 digit month + length of period (except that 12 month period is considered "0"). This was so that it could still fit in smallint field. I've got millions of rows linking back to the period and want to keep the index small, so didn't want the extra digit.
The trick for this was an insert/update trigger shown below:
CREATE TRIGGER [dbo].[Period_IU_Trig]
ON [dbo].[Period]
AFTER INSERT, UPDATE
AS
BEGIN
-- Set Period Id to be meaningful (YY + MM + Period Length)
SET NOCOUNT ON;
UPDATE Period
SET PeriodId = CAST(CAST(DATEPART(YY,INSERTED.StartDate) - 2000 AS VARCHAR(2)) +
CAST(RIGHT(100 + (DATEPART(MM,INSERTED.StartDate)),2) AS VARCHAR(2)) +
CAST((CASE WHEN INSERTED.MonthsInPeriod = 12 THEN 0 ELSE INSERTED.MonthsInPeriod END)AS CHAR(1)) AS SMALLINT)
FROM INSERTED
WHERE INSERTED.PeriodId = Period.PeriodId
END
Note, I normally don't advocate having "meaningful" surrogate keys (an oxymoron), but in this case, its very handy in testing to quickly identify the period without going back to the source table.
To present this to the users, I provide a view that formats the dates to show the date range in descending order of the period along with the length in months. For this scenario, this is easier to work with than a calendar control and works great for a drop down as you can see below
The stored procedure for formatting this is shown below:
CREATE procedure [Reports].[SelectCurrentPeriod]
AS
SELECT TOP (100) PERCENT PeriodId,
CAST(StartDate AS VARCHAR(10)) + ' To '
+ CAST(EndDate AS VARCHAR(10))
+ ' ('
+ CAST(MonthsInPeriod AS VARCHAR(2))
+ ' Months)' as PeriodDesc
FROM dbo.Period
WHERE (EndDate <
(SELECT MAX(LoadDate) AS Expr1
FROM dbo.LoadHistory))
There you have it, a period table and some supporting procedures. For our next tip, we'll look at how the period table actually comes into play when linking up with the data.
I recently discovered to my chagrin that Windows Home Server is not exactly a secure backup platform. Let me explain before I get in too much trouble...
On the plus side, The latest versions of WHS connectors can actually backup BitLocker and EFS volumes and it works with Vista 64 bit, and actually Windows Server 2008 (64 bit to boot) (though it's not documented). I'm finding out that just about anything that works with Vista 64 bit, in the way of not just application software, but drivers as well, works with W2K8. Also, WHS is running on a W2K3 server which means you can have strong passwords, etc.
But, the plus side turns out to be the negative side if somebody happens to steal your WHS. The problem is that all that nice and secure information that was encrypted can now be restored without encryption, you just need to get access to the drive. But, alas, "I used a strong password that nobody would guess". Once again, there's a problem - WHS provides an administrative recovery console procedure if you happen to forget the password - THAT DOES NOT REQUIRE A PASSWORD - When you use that feature, you can now get your WHS that you forgot the password for, up and running again, and access all that useful backup information. Voila, your household thieve just graduated to become an identity thief and has information to all that "secure" bit-locked or EFS data.
You may be thinking, why not just run BitLocker on the WHS, that won't work, it's W2K3. What about running EFS on the protected volume. I don't think that will work, these are raw volumes, and I don't see anything to indicate that this would be supported. You're not even supposed to be logging into that WHS even though remote desktop is available for it.
So, for those of us who don't just worry about ours tuff being stolen, but what might happen to it afterwards (I live in California, after all), but we'd also like to backup our laptops and benefit from the nice automation features of WHS, what is one to do?
Alas, Virtualization to the rescue... The idea is that you secure your WHS inside of an environment that can't be compromised. In my case, that is my Windows Server 2008 running on a Dell Precision workstation with BitLocker turned on. So, if somebody swipes my Precision (which is unlikely since they probably can't lift it and get it out of the house for several minutes with our home alarm system blaring all the time...), then they have to crack that O/S before they can even see the WHS Virtual disks. Installing WHS on a virtual is a breeze, BTW, worked right out of the box. So, now I backup my Microsoft laptop as well as my Windows 2008 Server (yes, you can backup the server that actually contains the virtual WHS including the virtual machines to the WHS). I then backup the WHS virtual machine to a removable disk which can be put in the safety deposit box.
So, what do do with my other WHS, the "real" one that is actually running in a box. I still use that to do backups of family computers that don't contain a lot of confidential information and use the file server features. There actually is no problem having more than 1 home server on the same network, just run the connector software install over on the machines that need to be changed over to a different WHS.
To make this a bit more clear, here is a diagram of my backup configuration:
So, how did I get to be so paranoid, don't ask...
Have you ever been asked by a high-level architect or CIO what Microsoft's approach is to a given scenario? If you have, then you know the challenges faced because as you read through MSDN and TechNet and even at the Microsoft.com site level, you find a lot of material which is very large and complex. Trying to distill things, particularly multiple things into a short executive-type summary is difficult. It's not to say that the information isn't there, it's just that there is so much of it and it is scattered around.
I was asked earlier this week to provide Microsoft's options for SQL Server high-availability. Below is what I cobbled together from MSDN - there's not much original material in here, but hopefully you'll find this helpful, if you need to provide the same type of information in a distilled format. I wrote this for SQL 2005. I confess I haven't gotten up to speed enough on SQL 2008 to know if there are any significant modifications needed, but I suspect that most of this should be application for 2008 as well.
Microsoft provides an architectural framework to support a high availability (HA) environment known as the Windows Server System Reference Architecture. It is a framework of best practices that leverage the high availability technologies built into the Microsoft line of products including SQL Server as well as Microsoft .NET Application Servers. It encompasses the following products and technologies:
- Database Engine
- SQL Server Database – Failover Clustering, Mirroring, Log Shipping, Replication
- SQL Server Analysis Services – Failover Clustering
- SQL Server Notification Services – Failover Clustering
- .NET Application Servers - Network Load Balancing Farm, Application Pooling approach with High-Availability Back-end SQL Database.
- SQL Server Reporting Services
- Microsoft .NET Framework Web Services (IIS/Component Enterprise Services)
- Microsoft Office SharePoint Server (MOSS) and Windows SharePoint Services
- BizTalk
There are several options available with SQL Server in regards to high availability. The relevant technologies are:
- Failover clustering provides high-availability support for an entire instance of SQL Server. A failover cluster is a combination of one or more nodes, or servers, with two or more shared disks. Applications such as SQL Server and Notification Services are each installed into a Microsoft Cluster Service (MSCS) cluster group, known as a resource group. At any time, each resource group is owned by only one node in the cluster. The application service has a virtual name that is independent of the node names, and is referred to as the failover cluster instance name. An application can connect to the failover cluster instance by referencing the failover cluster instance name. The application does not have to know which node hosts the failover cluster instance.
- A SQL Server failover cluster instance appears on the network as a single computer, but has functionality that provides failover from one node to another if the current node becomes unavailable. For example, during a non-disk hardware failure, operating system failure, or planned operating system upgrade, you can configure an instance of SQL Server on one node of a failover cluster to fail over to any other node in the disk group.
- A failover cluster protects against server failure, but does not protect against disk failure. Use failover clustering to reduce system downtime and provide higher application availability.
- Failover clustering does not protect against a geographic disaster and is intended for high availability within a data center. SQL Mirroring, SQL Replication or Log Shipping can be used to augment Failover Clustering for geographic fault tolerance.

Figure 1 - SQL Clustering
- Database mirroring is primarily a software solution to increase database availability by supporting almost instantaneous failover. Database mirroring can be used to maintain a single standby database, or mirror database, for a corresponding production database that is referred to as the principal database. Database mirroring provides geographic failover, since the mirrored database may be in an entirely different geographic location.
- The mirror database is created by restoring a database backup of the principal database with no recovery. This makes the mirror database is inaccessible to clients. However, it is possible to use it indirectly for reporting purposes by creating a database snapshot on the mirror database. The database snapshot provides clients with read-only access to the data in the database as it existed when the snapshot was created.
- Each database mirroring configuration involves a principal server that contains the principal database, and a mirror server that contains the mirror database. The mirror server continuously brings the mirror database up to date with the principal database.
- Database mirroring runs with either synchronous operation in high-safety mode, or asynchronous operation in high-performance mode. In high-performance mode, the transactions commit without waiting for the mirror server to write the log to disk, which maximizes performance. In high-safety mode, a committed transaction is committed on both partners, but at the risk of increased transaction latency.
- In its simplest configuration, database mirroring involves only the principal and mirror servers. In this configuration, if the principal server is lost, the mirror server can be used as a warm standby server, with possible data loss. High-safety mode supports an alternative configuration, high-safety mode with automatic failover. This configuration involves a third server instance, known as a witness, which enables the mirror server to act as a hot standby server. Failover from the principal database to the mirror database typically takes a few seconds.
- Database Mirroring can be integrated with Failover cluster to provide a geographic failover layer on top of a local failover approach. Typically, when mirroring is used with clustering, the principal server and mirror server both reside on clusters, with the principal server running on the failover clustered instance of one cluster and the mirror server running on the failover clustered instance of a different cluster. If a cluster failover makes a principal server temporarily unavailable, client connections are disconnected from the database. After the cluster failover completes, clients can reconnect to the principal server on the same cluster, or on a different cluster or an unclustered computer, depending on the operating mode

Figure 2 - SQL Mirroring

Figure 3 - SQL Mirroring integrated with Clustering
SQL Server Log Shipping
- Like database mirroring, log shipping operates at the database level. Log shipping can be used to maintain one or more warm standby databases, referred to as secondary databases, for a corresponding production database that is referred to as the primary database. Each secondary database is created by restoring a database backup of the primary database with no recovery, or with standby. Restoring with standby permits the resulting secondary database to be used for limited reporting purposes.
- A log shipping configuration includes a single primary server that contains the primary database, one or more secondary servers that each have a secondary database, and a monitor server. Each secondary server updates its secondary database at regular intervals from log backups of the primary database. Log shipping involves a user-modifiable delay between when the primary server creates a log backup of the primary database and when the secondary server restores the log backup. Before a failover can occur, a secondary database must be brought fully up-to-date by manually applying any unrestored log backups.
- Log shipping provides the flexibility of supporting multiple standby databases. If you require multiple standby databases, you can use log shipping alone or as a supplement to database mirroring. When these solutions are used together, the current principal database of the database mirroring configuration is also the current primary database of the log shipping configuration.

Figure 4 - SQL Log Shipping
- Replication uses a publish-subscribe model, allowing a primary server, referred to as the Publisher, to distribute data to one or more secondary servers, or Subscribers. Replication allows real-time availability and scalability across these servers. It supports filtering to provide a subset of data at Subscribers, and also allows partitioned updates. Subscribers are online and available for reporting or other functions, without query recovery. SQL Server offers three types of replication: snapshot, transactional, and merge. Transactional replication provides the lowest latency and is most commonly used for high availability.

Figure 5 - SQL Server Transactional Replication
References
· http://www.microsoft.com/windowsserver2003/wssra/default.mspx - Microsoft Windows Server System Reference Architecture
· http://msdn.microsoft.com/en-us/library/ms190202.aspx - SQL Server 2005 Books Online: Configuring High Availability
At this point, all I've given you is a set of tables that can store metadata about database tables and their relationships - they pretty much reflect a subset of what you can get from the system views. Let's take a look at the process for capturing this information as the database is modified. The following are the major steps:
- Create Entity Tracker database
- Create change log table, supporting tables to capture the data model snapshot, and a view to help later with being able to dynamically construct a graphical data model diagram.
- Create supporting stored procedure and user-defined function to be invoked by DDL Trigger to create the database snapshot
- Create the DDL Trigger to execute the stored procedure whenever a change occurs to any entities affecting the database.
- Implement the DDL trigger and stored procedure into the database to monitor. Note that the actual tables containing the changes are centralized into tables in the Entity Tracker database.
After these steps are done any change to any table or relationship will result in a new version of the information being stored in the Entity Tracker database. Over time, this will allow us to monitor how the data model has changed for a particular database.
Attached is a zip that contains the SQL needed to create the Entity Tracker database tables, UpdateSchemaRelations stored procedure, and UpdateSchemaVersion DDL trigger. Let's take a look at each step and some of the code.
1) Create the EntityTracker database - That's actually not in the zip (CREATE DATABASE ENTITYTRACKER is all you need or use SSMS). You might want to call it something else, that's not a problem, but you'll need to modify the UpdateSchemaRelations stored procedure in that case to match. The reason is that this stored procedure is actually implemented into the target database in addition to the DDL trigger. The reason is that the stored proc references system tables from the database being monitored and I haven't figured out a way to get the stored proc to be able to do that aside from putting it into the target DB (if somebody else knows a way, please comment).
2) Next, the change log and supporting tables (the ones from the diagram in the introductory post) need to be created along with the view to facilitate querying relationships. The script EntityTracker.Tables.sql and EntityTracker.Views.sql in the EntityTrackerCore.zip handles this. Note, we only care about the view_SchemaTableRelations view at this point, you can disregard the other views for now.
3) Create supporting stored procedure (UpdateSchemaRelations) for DDL Trigger to invoke and supporting user-defined function. These files are named UpdateSchemaRelations.Procedure.sql and the function is found in EntityTracker.Functions.sql (We are only interested in the dbo.udf_GetTableMaxDepth function in the EntityTracker.Functions.zip). This is the core logic that actually makes the snapshot of the data. Where possible, the procedure uses standard Information Schema views with the exception of the relationships snapshot (had to use Sys.Indexes for retrieving key info due to problem with unique index not included as table constraints when the primary columns are already defined as a separate primary key).
Before getting buried in the code, let's look at the high-level logic:
- Insert list of tables into SchemaTable table
- Insert list of referential constraints into TableRelation table
- Insert Relation column pairs associated with referential constraints into RelationColumnPair table
- Calculate the relationship for each depth using a custom TSQL recursive function that traverses the relationships to find the deepest level for each table. One of the challenges for this process is that a RDBMS allows self-joined tables and circular relationships so the function must check for these conditions to avoid infinite recursion. Note that this function relies on the dbo.udf_GetTableMaxDepth function which relies on the view_SchemaTableRelations.
For the next posts, we'll drill down into the specific .SQL code and finally we will look at the user interface in more detail. I'll also make a live demo available that you can use to get an appreciation of the user experience. After that, we'll go back down into the code used to support the UI and discuss extending this technique for other scenarios.
Have you ever wanted to view how a database model has evolved over time? It's not that hard to setup an infrastructure for supporting this. In this series of articles, I'll go over the process for this. For a jump start on the whole process, I've provided a PowerPoint and a paper that I wrote on the entire process. In this series, I'll provide more of a step-by-step, since the end-to-end solution encompasses more components than can easily be digested in a single sitting.
Approach
- Create a schema to support capturing all entities and their relations along with a change log table with a revision id as the root key for all related tables.
- Create a stored procedure to insert from the system tables and views into the appropriate tables
- Add a DDL trigger to invoke the stored procedure to insert the snapshot
- Create a services layer to query the model
- Create a Click-once application to query the model and view in either replay or real-time mode.
Schema
Below is the schema I chose to support the required data.
Here is a summary of the key tables and their purposes:
- ChangeLog – Contains details about the database change – originating user, date/time, object changed, command issued, and change log version ID. Each entry in the change log correlates to a complete snapshot of a database schema for a point in time.
- SchemaTable – Contains detail about the table changed including the affected schema for a particular database version.
- TableRelation – Identifies the relationships between different tables for a particular database version.
- RelationColumnPair – Identifies the table columns that are used to join two tables together in a relationship.
That's all I have time for today - got to catch the train. For the next post, I plan to get more into the guts of the stored procedures that actually populate this data and I'll provide a snapshot of what the functionality looks like. Hopefully, I'll have a demo version up and running on my server in the next couple of day so you can see the interface. I'm busy setting up an ISA Server to allow me to take advantage of my 8 different IP addresses to facilitate having more than 1 web server tied to my leithisers.com domain.
After a false start a couple months back, I'm determined to get more involved in the community and start blogging daily and not let all the distractions get me away from this. My background includes over 22 years in software development. I am currently working for Microsoft as a Sr. Consultant working in the Department of Defense sector. My primary areas of focus include .NET, Team System and Team Foundation Server, SQL Server and SOA and interactions with SharePoint, BizTalk, Windows Workflow, and Windows Communication Foundations. I am also pursuing a PhD from Auburn University. My hope is to start posting whatever I can from my work experience and academic experience. My area of research is software automation, particularly as it is enabled through well-defined relational repositories.
Here is what I plan to post over the next few days to get things rolling
1) The Data Model Monitor - This is a tool that not only audits SQL Server database changes using a Database Definition Language trigger, but includes a schema and supporting stored procedures to capture a snapshot of a database model after each change. Along with this, there is a replay function that allows a user to see how a database model has evolved. The tool also supports
2) Working with Hyper-V to setup a virtual lab - Experiences from the field. I'll be sharing with you what I am learning as I am endeavoring to setup a virtual lab of several machines running under Hyper-V.
3) Data and Business Layer generator - This tool generates data and business objects to help automate these aspects of an application and de-couple these layers from the UI. I'll be discussing the Microsoft Enterprise library as part of this exercise.
4) Basic techniques for finding bottlenecks with SQL Server. I'll discuss how to use the trace output, store the data in a database and then use queries to identify quickly how to find the low hanging fruit for a poorly performing database to cleanup queries and setup appropriate indexes.
5) Setting up ADFS with SharePoint and providing a SOA interface point through an Enterprise Bus.
6) Using Team Foundation Server work items with Microsoft Project integration in order to implement time tracking for a development project.
This will keep me busy for a while...
"Microsoft Bob"