Many people have been asking for complete coding samples involving data mining.  Also, many people have been frustrated trying to generate reports using time series predictions.  So I decided to kill two birds with one stone and create a sample stored procedure you can use to see how some things are done, while simultaneously providing the ability to generate meaningful time series reports.

First a little background on the issue.  The desired result for a time series report is to produce a chart that shows some historical data in one color, and then shows some forecasted data in another color.  The difficulty lies in that Reporting Services requires that you produce all data for a single chart in a single dataset, and you cannot get both the historical data and the forecasted data from a single DMX query.

Basically I needed to return both results from a single query so I created the following stored procedure

public DataTable TimeSeriesReport(string ModelName,
                                                       string ColumnName, 
                                                       string SeriesName,
                                                       int NumHistoricalPoints,
                                                       int NumPredictions)

where ModelName is the name of the model; ColumnName is the name of the column to forecast; SeriesName is the name of the series to forecast if the model has series (more on this later); NumHistoricalPoints is the number of points from the original data to return; and NumPredictions is the number of predicted points to return.

Since SQL Server Data Mining offers a lot of flexibility in how time series models are modeled, I limited the types of models that are supported for this exercise.  This stored procedure work only with models that are "flat", i.e. that do not have nested tables.  These types of time series models are generally created only when you are creating models from OLAP cubes.  The procedure can be modified to support such models, but I just didn't do so for the purpose of this exercise.

The procedure supports models that contain both KEY and KEY TIME columns or only KEY TIME columns.  That is you can have a model that looks like this:

CREATE MINING MODEL WineForecasting
(
     Quarter DATETIME KEY TIME,
     Red       DOUBLE PREDICT,
     White    DOUBLE PREDICT,
     Sparkling DOUBLE PREDICT
) USING Microsoft_Time_Series

or like this

CREATE MINING MODEL WineForecasting
(
     Quarter DATETIME KEY TIME,
     WineType STRING KEY,
     Value DOUBLE PREDICT
) USING Microsoft_Time_Series

where WineType could be "Red", "White" or "Sparkling."  If you specify the SeriesName parameter in the call to the stored procedure, the procedure assumes you are using the latter model form.  If you leave the SeriesName empty (""), the procedure assumes the former model form.

The NumHistoricalPoints parameter is there to limit the amount of data shown in the result.  Reports tend to want to show all the data from the source query, and if there are 1000 historical points and you only want to forecast 5, the historical data visually overwhelms the forecasted data and you get a useless result.  Typical queries using this stored procedure would look like:

CALL TSSprocs.TimeSeriesReport("WineForecasting", "Red", "", 25, 10)

or

CALL TSSprocs.TimeSeriesReport("WineForecasting", "Value", "Red", 25, 10)

The output set has three columns - one for the timestamp, one for the historical data, and one for the forecasted data.  The flexibility of data types supported for the time stamp column complicated the code, so for simplicity I short circuited that flexibility and simply replaced it with an ordinal, with numbers <= 0 indicating historical rows, and numbers > 0 indicated forecasted rows.  Additionally to make the resulting chart "look right" I put the last historical point as the first forecasted point.  This allows Reporting Services and other charting engines to produce a seamless chart.  Without this addition, there would be a gap in the chart between historical and forecasted data, and it would just look ugly.

The outline of the code is basically:

  1. Validate Inputs
  2. Create output data table
  3. Return empty table if executing for prepare
  4. Query Historical Data and populate data table
  5. Query Forecasted data and populate data table
  6. Return data table

The code implementation was fairly straightforward and in total took about three hours including debugging, although I did find myself leverage quite a few little tricks as I progressed through the code, which I will discuss as I analyze the implementation in my next post.