Using DTS and Analysis Services to Analyse Blog Logs

Using DTS and Analysis Services to Analyse Blog Logs

  • Comments 2

Following on from the entry I wrote about analysing blog logs, one or two people have asked me how I set everything up to do this. It's a really good use of Analysis Services, actually, and offers a good example of how Microsoft's business intelligence tools work together. Nothing is too complicated, although there are a number of steps involve to get everything working. If you've a general idea of what DTS and Analysis Services are, you should be able to follow these instructions fairly well - drop a note in the comments section if you're struggling and I'll see what I can do to help out. Note that the blogs hosted here use BlogX, and so what I've written here presumes a log format that matches the one on our server.

Step 1: Generating a raw log file
The first step is of course to get hold of the raw log files that contain the information to be analysed. The log files on my server are stored in a separate file per day, so I use a standard command-line utility called wget to pull down these files over HTTP using a batch file. (Incidentally, here's a quick-and-dirty C# version of wget that I wrote.) Once I've got hold of the files, I concatenate them together into a single log file called bloglog.txt using a batch file. This is about 60MB in size, as of the last count.

Step 2: Generating a destination table
Next we need to create a database table into which the cleansed, transformed version of the log files will be inserted. To achieve this, simply execute the following SQL (or similar):

   CREATE TABLE [Hits] (
      [HitID] [bigint] IDENTITY (1, 1) NOT NULL,
      [HitTime] [datetime] NULL,
      [URL] [varchar] (400) NULL,
      [URLReferrer] [varchar] (400) NULL,
      [UserAgent] [varchar] (400) NULL,
      CONSTRAINT [PK_Hits] PRIMARY KEY CLUSTERED ([HitID])  
   ) 

Step 3: Generating a DTS Package
The main thing to do here is to copy across the data from the raw tab-delimited file to the Hits table. You can do this by creating a Transform Data task that copies the appropriate columns across (using Copy Column transforms). For the date column, you need to use an ActiveX Script transform, since the date format is different in each case. Here's the VBScript to slice the input appropriately:

   '  Copy each source column to the destination column
   Function Main()
      DTSDestination("HitTime") = Left(DTSSource("Col003"), 10) & _
         " " & Mid(DTSSource("Col003"), 12, 8)
      Main = DTSTransformStat_OK
   End Function

Test the package, and then execute it. If all goes well, you should now have a SQL Server table containing the log file. (If you'd like to save time in this step, you can download the DTS package I created and modify it to match your own table and source file names.)

(Optional) You might now want to go back and add some steps to the DTS package to clean up the table before each execution, run the batch file created in step 1, and even process the cube we'll create in step 4. This is how the final DTS package should look:

Step 4: Creating an Analysis Services Cube
Within Analysis Manager, create a new database. Set the data source to the SQL database you've just created. Now right-click on the Cubes node and choose the Cube Wizard. In a more complex data source, you'd split the fact table (the values you're browsing) and the dimension tables (the things you use to slice and dice the values) out into separate tables, but here everything is conveniently stored in the one table. The HitID is the measure - although the value has no meaning, we'll set the usage of this later.

Create four dimensions, using the star schema model and accepting all the defaults:

  1. URL (standard dimension type based on the URL column)
  2. Referrer (standard dimension type based on the URLReferrer column)
  3. User Agent (standard dimension type based on the UserAgent column)
  4. Time (time dimension type, year / quarter / month / day hierarchy, based on the HitTime column)

When the wizard dumps you unceremoniously into the Cube Editor, select the HitId measure and click the Properties button in the bottom left hand corner. Set the Aggregate Function to be Count, not Sum. This switches the aggregations to count how many entries there are, not the values contained in them. Lastly, choose Tools / Process Cube, designing storage using the defaults and then executing the process. The cube is now available for browsing.

NB Make sure you choose the option to incrementally update the shared dimensions, if you've created the dimensions above as shared - otherwise the processing step will fail on repeated attempts.

Step 5: Use Excel to Browse the Cube
Now comes the fun part! In Excel, choose Data / PivotTable and PivotChart Report. Select External Data Source and get the data from the OLAP cube (you might have to set up a new query using Microsoft Query). Accept all the other defaults, and you'll wind up with an empty PivotTable embedded in the spreadsheet. You can now drag fields to the PivotTable - make sure you drag HitID to the centre of the table, and then drag the dimensions to rows or columns. The context-sensitive menus allow you to customise sorting, grouping and formatting. Create a PivotChart to see a graphical view, such as this:

Well done if you've got this far! Have fun analysing your blog data and identifying some of the trends. You might be surprised as to which kinds of blog entries are the most popular, or what aggregators are visiting your site...

  • hello,
    This is a good site, but we would like to have a tutorial of step by step for new data warehouse users. I have already my operational data base but I don't know the steps used to create a star schema to use for my data warehouse.
    Thank you
  • I've been doing fine with the pivot table and Office XP, but have just upgraded to Office 2003 and it's stopped working :-( It lets me choose a cube but failes when it comes to create the chart with "Unable to initialise data source". I'm wondering if this is a known problem. I'm on Analysis Services Feb CTP, but we've also tested this on Beta 2.
Page 1 of 1 (2 items)