rdoherty's WebLog

  • PDC ‘09 Business Intelligence Workshop Announced

    I just wanted to spread the word that I recruited Andrew Brust to deliver the first-ever PDC workshop focused exclusively on building business intelligence solutions.  He’ll be teaming up with Dom Demsak to deliver an intensive all-day workshop designed exclusively for developers.

    The workshop is entitled Developing Microsoft BI Applications – The How and the Why.  If you are a .NET developer who wants to round out their skill set by understanding how to build reporting and analytics solutions that deliver real business value, then I suggest that you register for this workshop now before it sells out.

    PDC09Bling_Workshops_HandsDirty_136

  • Announcing the SQL Server 2008 Developer Training Kit

    I’m pleased to announce that the SQL Server 2008 Developer Training Kit is available as a free download from the Microsoft Download center, so download your copy today at the following URL:

    http://go.microsoft.com/?linkid=9665942 

    Read on for more details….

    clip_image002

    SQL Server 2008 offers an impressive array of capabilities for developers that build upon key innovations introduced in SQL Server 2005. The SQL Server 2008 Developer Training Kit will help you understand how to build web applications which deeply exploit the rich data types, programming models and new development paradigms in SQL Server 2008. The training kit is brought to you by Microsoft Developer and Platform Evangelism.

    Overview and Benefits

    The training kit offers the following benefits:

    • Learn how to build web applications that exploit the unique features and capabilities of SQL Server 2008.
    • Provides a comprehensive set of presentations, demos and hands-on labs
    • Designed for web developers who are already familiar with SQL Server application development.
    • Easy to download and install and works on free editions of SQL Server 2008 and Visual Studio 2008.

    Software Requirements

    The following software is not included with the training kit and must be downloaded and installed separately.

    • SQL Server 2008 Express with Advanced Services (or higher-level SKU)
    • Microsoft Visual Web Developer 2008 Express (or higher-level SKU)
    • AdventureWorks 2008 Sample Databases

    Intended Audience

    The training kit is designed for the following technical roles:

    • Web developers who build applications for the Microsoft platform.
    • Microsoft evangelists, technical specialists and consultants.

    Contents

    The training kit includes the following content:

    • Presentations (6)
      • Filestream
      • Spatial
      • T-SQL
      • Date and Time Types
      • SQLCLR
      • Reporting Services
    • Demos (12)
      • AdventureWorks Racing All-Up SQL Server 2008 Demo
      • SQL Server 2008 All-Up Spatial Demo
      • Spatial Types Demo
      • Intro to Filestream Demo
      • SQL CLR Nullable Types Demo
      • Programming with Filestream Demo
      • Reporting Services Web Application Integration Demo
      • Date and Time Support in SQL Server 2008 Demo
      • T-SQL Table-Valued Parameters Demo
      • T-SQL Row Constructors Demo
      • T-SQL Grouping Sets Demo
      • T-SQL Merge Demo
    • Hands-on Labs (3)
      • Using Spatial Data in TSQL
      • Using Spatial Data in Managed Code
      • Using SQL CLR in SQL Server 2008
  • March Madness On Demand: Data Collection Service Architecture

    In my previous post I talked about how we used SQL Server 2008 Service Broker to improve the durability of a data collection service for our March Madness Video On-Demand Silverlight player.  This post provides a bit of background on the architecture of the data collection service. 

    We were expecting a ton of traffic during the tournament, and we didn’t want to lose any data or become unresponsive.  Service Broker provided just the ticket.  We used 16 Service Broker “initiators” on our web tier to receive the log requests and deliver them asynchronously to a Service Broker “target” on our data tier.  There they were read off a queue, processed and consolidated into an OLTP database.   Here’s a picture of the architecture along with some stats.

    March Madness On-Demand

    The system was collecting log data at the rate of 60 MB / sec at its peak during the tournament.  Our target queue can process messages at the peak rate of about 600 messages / sec with our current hardware configuration (2X quad-core Intel Xeon).  If our traffic exceeds that (which it has on a couple of occasions) we start to see messages back up in the target queue, and the system always catches up after traffic drops off.  It’s worth noting that we could improve the throughput of our target queue with some re-engineering, but for the time being we decided to just throw more hardware at it to improve throughput.  More on that in a future post.

    Here’s what the service broker traffic looked like (courtesy of System Center Operations Manager) from our web tier initiator queues on the peak day which was March 20th, 2009 during Round 1 of the Tournament.  The Y axis measures bytes sent / sec from our initiator queues:

    mmod_webtierssbtraffic 

    Now that I’ve covered architecture, I’ll move on to some specifics in subsequent posts.

  • March Madness on Demand

    One of the best things about my job is that I get the opportunity to jump in and lend a hand to other teams who are trying to accomplish big things and need some great database technology to back them up.  Back in December I was approached by Eric Schmidt of DPE's Media and Advertising team to build a massive monitoring and telemetry system for the March Madness On Demand (MMOD) project with CBSSports.com.   This is our latest effort to drive Silverlight adoption for major media events, PR is here.

    Let me see, should I build a huge web farm to collect QoS data from millions of MMOD web viewers and roll it up into a massive real-time SQL Server 2008 data warehouse that provides Operational Business Intelligence to the virtual MMOD team, and Consumer Business Intelligence capabilities to MMOD viewers.  Um, YES!

    This is a similar project to the 2008 Summer Olympics, but we are taking it to the next level.  The operations team will use the data warehouse for monitoring and telemetry as we did last time.  It's like having leads attached to a million heart patients so we can detect an abnormal heart rate and treat it before it gets out of hand. 

    What's new this round is analyzing usage patterns and preferences in the data warehouse and feeding it back into the player to drive a higher quality user experience.  Here's how it works.

    First, the monitoring data is collected at the Silverlight 2.0 video player client and queued on 20 front-end Windows Server 2008 web servers running an IIS 7.0 / ASP.NET / Windows Communication Foundation / SQL Server Service Broker application.

    Next, the messages are routed to a target Service Broker queue where they are processed and consolidated in a log table on a central SQL Server 2008 OLTP database. 

    Every five minutes, a SQL Server Integration Services (SSIS) package wakes up and extracts the new log table data from the OLTP database, transforms it, and loads it into a central SQL Server 2008 Data Warehouse on a separate system to be used for operational and consumer BI.  Here's a screen shot of the SSIS package executing:

    MMOD_SSIS

    Every ten minutes, a SQL Server Reporting Services (SSRS) report gets executed using a scheduled report subscription.  The data source for the report is a Transact-SQL stored procedure which gets the top 12 videos viewed in the past 60 minutes.

    Here's the code for the GetTopVideos stored procedure.  Note that it automatically switches to the top 12 videos of all time if there wasn't much activity in the last 60 minutes.

    CREATE PROCEDURE [VideoMonDW].[GetTopVideoList]
    (
        @LiveVideoType nvarchar(450) = N'Live Game Video'
        ,@VideoStartEventCode tinyint = 18
    )
    AS
    BEGIN
        DECLARE @StopDate AS datetimeoffset(7);
        SELECT
            @StopDate = MAX(ServerLogTime)
        FROM
            VideoMonDW.RawLog;
        DECLARE @StartDate datetimeoffset(7) = DATEADD(minute,-16,@StopDate);
        DECLARE @Top12List AS TABLE
        (
            VideoId bigint
            ,VideoViews bigint
        );

        INSERT @Top12List
        (
            VideoId
            ,VideoViews
        )
        SELECT TOP 12
            RL.VideoId
            ,COUNT(RL.LogId)
        FROM
            VideoMonDW.RawLog RL
        WHERE
            (RL.EventCode = @VideoStartEventCode)
            AND (RL.VideoTypeId IN (SELECT VideoTypeId FROM VideoMonDW.VideoTypes WHERE VideoType <> @LiveVideoType))
            AND (RL.Created BETWEEN @StartDate AND @StopDate)
        GROUP BY
            RL.VideoId
        ORDER BY
            2 DESC;
        IF @@ROWCOUNT <= 2
        BEGIN
            -- Not much activity in the last hour so switch to top 12 list of all time
            DELETE @Top12List;
            INSERT @Top12List
            (
                VideoId
                ,VideoViews
            )
            SELECT TOP 12
                RL.VideoId
                ,COUNT(RL.LogId)
            FROM
                VideoMonDW.RawLog RL
            WHERE
                (RL.EventCode = @VideoStartEventCode)
                AND (RL.VideoTypeId IN (SELECT VideoTypeId FROM VideoMonDW.VideoTypes WHERE VideoType <> @LiveVideoType))
            GROUP BY
                RL.VideoId
            ORDER BY
                2 DESC;
        END;

        SELECT
            VI.VideoTitle AS [title]
            ,VI.VideoKey AS [guid]
            ,TL.VideoViews AS [description]
            ,VI.Created AS [pubDate]
        FROM
            @Top12List TL
                INNER JOIN VideoMonDW.Videos VI ON TL.VideoId = VI.VideoId;
        RETURN 0;
    END;

    The SSRS subscription renders the report as XML transformed into RSS 2.0 using a server-side XSLT transform, and drops it into our web farm which uses DFSR to replicate the update to all our web servers.  The feed looks like this:

    MMOD_RSS 

    We wanted to avoid latency issues for users who are far away from our web farm hosted at ORCS in Charlotte, NC, so we push the file to a CDN every 10 minutes and the Silverlight player picks it up from there. 

    And all this for what?  To provide real-time information on user preferences in our Silverlight 2.0 HQ player.  Here's a screen shot of the "What's Hot" page which lists the top 12 game videos viewers are watching in the last 60 minutes.  The data comes from our RSS feed:

    MMOD_TopVideos

    I'm heading to MIX '09 next week to talk more about this. If you are there look me up!  Check out visitmix.com for the latest from the MIX community.

  • VSTS Database Edition GDR Feedback Vote Request

    If you use the Microsoft® Visual Studio Team System 2008 Database Edition GDR and SQL Server CLR Integration, do me a favor and vote for this feedback item on Microsoft Connect:

    • Connection: Visual Studio and .NET Framework
    • Feedback ID: 418322
    • Title: Database Edition GDR: SQLCLR assembly dependencies cause deployment scripts to fail

    Thanks!

  • Feedback 101: How Microsoft Works (Sometimes...)

    As a SQL Server Technical Evangelist I have a lot of customers.  My internal customers include the SQL Server and Visual Studio development teams.  The primary thing they are looking for from me is to drive great feedback back into the development teams to help them build software that actually addresses the needs of database developers.  I do this in a variety of ways, including trying to facilitate face time between the development teams and real database developers who are using their technology.

    Unfortunately much of this feedback is anecdotal.  It requires the person representing the development team to be diligent about reflecting the feedback they received in product plans and specifications, and the reality is that the loop doesn't always get closed.  So I've been thinking about other ways to drive feedback that ensure a suggestion will get looked at as part of the product planning cycle.  Now Microsoft cooks up a new feedback engine du jour about every other week, but there is one feedback engine which has had some staying power, and that is Microsoft Connect.  Whether you love Connect or hate it, the bottom line is that when I go to the wall with development over some piece of feedback, they want evidence that it will have broad appeal, and that evidence comes from Connect.

    So from now on, I'm going to log all of my suggestions as feedback on Connect, then I'm going to blog about them here.  My hope is that folks who watch my blog will spread the word and vote for the feedback.  The more votes a feedback item gets, the higher the chance that it will get addressed. 

    If you have submitted feedback on Connect that you feel is important and want me to drive votes for it, let me know by emailing sqldevx@microsoft.com.  Make sure you include the appropriate connection (e.g. "SQL Server" or "Visual Studio and .NET Framework") as well as the ID and title of your feedback item on Connect so I can find it.  If I think its important enough to blog about, I'll do it.

    Technorati Tags:
  • SQL Server 2008 Integration Services - Scalability Lessons Learned

    My colleague Larry Clark contacted me today regarding my PDC presentation covering the 2008 Summer Olympics video monitoring project.  Basically he wanted more information on some of the lessons I learned to make my SSIS package scale.  This package I built for this project ran every five minutes, and at peak loads needed to extract up to a million log rows, transform them, and load them into our data warehouse.  I built about four versions of this package before I got it right.  Here's some of the things I learned.

    1. Use snapshots to improve concurrency. 

      My source table was seeing a ton of insert activity.  I needed to get in and read the new rows very quickly with minimal impact on concurrency.  I did not want to block new inserts to the source table while my ETL process was running.  Dirty reads were not an option because I relied on a sequence number in the source table as my high watermark. 

      I found that the fastest way to do this is to create a snapshot of the new rows by running select subquery in an insert statement.  The subquery pulls rows from the source which are then inserted into a scratch table on the same instance.  This approach moves the data in-process, minimizing the amount of time that locks need to be held on the source table.  The tradeoff is the additional IO required to write out the copy.  The scratch table is a simple heap with no indexes.   

      In my package, I implemented this using an Execute SQL Task that runs the following stored procedure:
    2. CREATE PROCEDURE [VideoMon].[ExtractLogRows]
          @MaxLogId bigint
      AS
      BEGIN
          TRUNCATE TABLE [VideoMon].[RawLogExport];
          INSERT INTO [VideoMon].[RawLogExport]
          ([LogId]
          ,[BrowserId]
          ,[BitRate]
          ...
          ,[VideoUrlId]
          FROM [VideoMon].[RawLog]
          WHERE [VideoMon].[RawLog].[LogId] > @MaxLogId;
      END

      Note the search argument which allows me to extract only the new log rows since the last load.  The @MaxLogId is the high water mark from the previous load.

      Remember SSIS imposes some default locking behavior to handle rolling back changes if a package fails.  You may need to override that behavior to avoid unintentionally holding locks on your source while the package is running. I did this by changing the default IsolationLevel at the package level from Serializable to Unspecified, then explicitly setting the IsolationLevel on my control flow containers as required. 

      This approach is markedly different from those advocated by people who think scratch table IO is evil, and that all transformation operations should occur in-flight.  I was one of those people before I built this package.  SSIS has an impressive array of capabilities that make it possible to transform data without having to write it back to disc multiple times.  The volume of data I was dealing with made this approach impossible.  In one early version of the package, I used multiple sort and merge transforms to manipulate all of the data in memory.  This worked great with smaller batches of rows but once volume picked up it put the whole system under memory pressure and my package started failing.  The resulting package uses both approaches judiciously (scratch tables and in-flight transformation) for a winning approach.

      One final point.  I was asked by my colleague Tom Pizzato why I didn't use the new Change Data Capture (CDC) support in SQL Server 2008 to implement this part.  The answer is that since I had complete control over the schema of the source system, I could build my own light-weight change data capture mechanism and reduce the complexity of the system a bit.  If I didn't have that kind of control I would have definitely used CDC.

    3. Reduce or eliminate dependencies on MS DTC.

      My initial attempts at building this package resulted in some very large, very long-running distributed transactions.  For smaller batches of new rows, the package seemed to work fine, but as the row volumes increased, the package started to fail with errors like this:

      Error: 0xC0202009 at Update Client Addresses, ClientAddresses Destination [25]: SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005.  An OLE DB record is available.  Source: "Microsoft SQL Server Native Client 10.0"  Hresult: 0x80004005  Description: "The Microsoft Distributed Transaction Coordinator (MS DTC) has cancelled the distributed transaction.".

      Unfortunately I didn't have time to properly investigate the problem I was having with MS DTC, so I changed the architecture of the package to minimize the scope and duration of the distributed transaction.  This turned out to be a winning strategy.  I was able to limit the scope of the distributed transaction to a single data flow task that copied data from the scratch table on my source to a scratch table on my destination.  This was a raw insert operation, I avoided performing transformations during this move to make sure I wasn't slowing things down and unintentionally extending the duration of the distributed transaction.

      Again the tradeoff is a bit of additional IO, I used scratch tables both on my source and destination, but this greatly improved the reliability and performance of the package as a whole.
    4. Design for parallel execution.

      The SSIS team has done a great job of improving thread scheduling and management in the SQL Server 2008 release.  This eliminates the need to manually configure the MaxConcurrentExecutables package level property to avoid starving the system.  Many package designers make the mistake of executing tasks serially when the don't need to be.  If you are executing tasks serially, make sure that you are doing so because you have a logical requirement related to data integrity.  If such a requirement does not exist, run them in parallel. 

      In my case I was able to run all of the data flow tasks associated with dimension table loading in parallel except for one which I saved till the end because it required more CPU horsepower (more on that later).  Many of these parallel data flows rarely updated tables because the dimensions they loaded were relatively static, so by running them in parallel I was making good use of my hardware and not holding up other data flow tasks that were actually doing real work.

      Here's a picture of the control flow in the package, you can see the parallel dimension loading routines clustered in the middle:

      OlympicsSSIS
    5. Don't extract more data than you need.

      This one probably goes without saying, but I'm including it for newbies.   The only data I could guarantee was new during a load was the fact data (log rows).  I could not guarantee that all of the dimension data represented in those facts was new.  Some of these dimensions could get quite large, so my package needed to be very selective about pulling incremental dimension data from the source system.  Here's how I implemented it for my video URL dimension:

      dimload 

      This data flow task first uses a SELECT DISTINCT to get a unique set of video URL keys from the destination scratch table containing my new fact data.  Next I do a lookup against my destination dimension table.  A match indicates that the dimension row already exists and no action is required.  Note that all the processing up to this point happens at my destination, imposing no load whatsoever on my source.  A non-match indicates that the dimension row doesn't exist, so I do a second lookup against the source system to retrieve the new dimension row.  Note that I'm only doing this lookup for rows I absolutely know that I do not have.  Since the video URL key already exists, I know it has to exist on my source, so the operation can be done with a dirty read which I implemented by setting the IsolationLevel on the data flow task to ReadUncommitted.  The combination of being very selective on which dimension rows I need to retrieve, combined with the less restrictive isolation level ensures that I am not consuming unnecessary resources at the source.

      This is of course a very simplistic dimension.  If it were a slowly-changing dimension I would have used a more sophisticated approach such as the new SQL Server 2008 MERGE command.
    6. Consider using a script transformation instead of an Integration Services expression for complex calculations.

      SSIS includes a built-in C++ like expression syntax for calculating data values.  The syntax is fast and easy to use for simple calculations, but in my case I had a relatively complex calculation that would have been very painful to implement using an Integration Services expression, not to mention hard to maintain.  After a failed attempt to write the expression, I turned to a custom script transformation to solve the problem. 

      As rows flow through a custom script transformation, it has the power to modify them in memory and pass them along to the next element in the data flow.  My transform needed to calculate an IP number from an IP address in order to geo-tag a log event.  I was able to locate a C# routine on Justin Cook's blog post here that performs the IP Number calculation (thx Jason!).  Using this code inside my data flow proved surprisingly easy.  I used a derived column transform to add a blank column to the result set of the right size and type, then the pasted Jason's code into my custom script transform.  The resulting script looks like this:
    7. public override void MyAddressInput_ProcessInputRow(MyAddressInputBuffer Row)
      {
          Row.IpNumber = IPAddressToNumber(Row.ClientAddress);
      }

      public static Int64 IPAddressToNumber(String IPaddress)
      {
          Int64 i;
          string[] arrDec;
          Int64 num = 0;
          arrDec = IPaddress.Split('.');
          for (i = arrDec.Length - 1; i >= 0; i--)
          {
              num += ((Int64.Parse(arrDec[i]) % 256) * (Int64)Math.Pow(256, (3 - i)));
          }
          return num;
      }

      The first method runs once per-row, and calls Jason's routine to calculate the IP Number and store it in my blank derived column.  Note that the script gets compiled at run-time, so this is not executed as an interpreted operation as was the case in the DTS days.  I didn't have the opportunity to compare the performance of the script transformation approach vs. the Integration Services expression approach as I gave up trying to write the expression, it was way too complex and would likely have been very fragile from a maintenance perspective.  I found this to be a much better approach and the performance was great.

    8. Consider using a conditional split transform to apply more CPU power to an expensive lookup operation.

      I mentioned earlier that one of my transforms required more CPU power, specifically the transform that geo-tags client IP addresses.  We licensed a data set of all registered IP address ranges and their associated geographies from Ip2Location.  Associating an IP address to its correct range involved converting the IP address to an IP number.  I found a fast way of doing this described in section #5 above. Once I had the IP number, I had to write a query in Transact-SQL to associate it with the correct IP address range in the Ip2Location data set. 

      This turned out not to be a trivial task.  The Ip2Location data set had over 6 million rows in it, and the search arguments were not a direct match, but needed to use aggregation and range searches to determine the correct range that an IP number belonged to.  A range was defined as a starting IP Number (FROM address) and an ending IP Number (TO address), and I used an IDENTITY to uniquely identify each range when I initially loaded the data set.  I could have created a unique index on the combination of the FROM and TO address, but I thought I would be able to make the query perform better if there was a single column primary key.  This also simplified my dimensional data model by allowing foreign key references to be a single column, which had major performance advantages later when I built my cubes.

      First I needed to get the IP range lookup query to work, then I had to figure out the right indexing strategy on the data set to make it perform well.  I decided to implement the query as a scalar User Defined Function (UDF) so I could re-use it in a variety of operations:
    9. CREATE FUNCTION [VideoMonDW].[GetIpId]
      (
          @IpNumber bigint
      )
      RETURNS bigint
      WITH SCHEMABINDING
      AS
      BEGIN
          -- Declare the return variable here
          DECLARE @IpId bigint

          -- Add the T-SQL statements to compute the return value here
          SELECT
              @IpId = IP_ID
          FROM
              VideoMonDW.Ip2Location IP
          WHERE
              (IP.IP_FROM = (SELECT MAX(IP_FROM) FROM VideoMonDW.Ip2Location IP2 WHERE IP2.IP_FROM <= @IpNumber)) AND
              (IP.IP_TO = (SELECT MIN(IP_TO) FROM VideoMonDW.Ip2Location IP3 WHERE IP3.IP_TO >= @IpNumber));

          -- Return the result of the function
          RETURN @IpId;
      END

      The "IP ID" is the unique identifier I generated using an IDENTITY that uniquely identifies an IP address range.  Note the search arguments in red.  These are very expensive aggregation and range search operations.  Running this query with no indexes took over 30 seconds per lookup on my dev box.  That obviously was not going to scale.  Next I created indexes on every combination of the primary key (IP ID), FROM and TO columns I could think of, including compound indexes.  Finally, I re-ran the query using SHOWPLAN and figured out which indexes were actually getting used.  The results showed me that the following indexes were getting used:

      IP_ID (Unique clustered index)
      FROM, TO (Non-clustered index)
      TO (Non-clustered index)

      These indexes got the query execution time down to 4 seconds on my dev box.   That was still too expensive when you consider the volumes I was dealing with.  I had tuned the lookup query as much as I could at this point, I needed to find another way to make them scale.  What I decided to do was split my data set into two streams using the Conditional Split transform like this:

      condsplit 

      The data set comes into the conditional split which has a default output for odd-numbered rows, and a second output for even-numbered rows using a simple condition:

      ClientAddressId % 2 == 0

      SSIS will automatically spin up multiple threads to process each side of the stream, giving me double the CPU horsepower than I would have had if I had stuck with a single stream.  Note that my seleciton of two streams was completely arbitrary, it could have been four, or six.  I just found that a simple even/odd calculation was easy to understand and most likely easier to maintain.

      Note also that I used an OLE DB Command as the destination for each stream instead of a standard OLE DB destination adaptor.  OLE DB Command transformations execute a SQL command once for each row in the data set.  I used this approach for two reasons.  First, since I had implemented my IP address range lookup as a TSQL function (see GetIpId function above), I wanted to call that function from the values clause of an insert statement rather than writing another lookup transformation in order to save time and reduce complexity.  Secondly, OLE DB Destination adapters default to an AccessMode of "OpenRowset Using FastLoad".  This is the fastest way to bulk-insert rows into a table, but after some testing I found my two threads were blocking each other with table locks. 

      Here's what the syntax of the insert statement looks like in my OLE DB command objects, note use of the GetIpId function in the values clause:

      INSERT [VideoMonDW].[ClientAddresses
      (
        [ClientAddressId],
        [ClientAddress],
        [Created],
        [IpId]
      )
      VALUES
      (
        ?,
        ?,
        ?,
        VideoMonDW.GetIpId(?)
      );

      OLE DB Command Objects do not use bulk-loading API's, so I no longer had a problem with my two insert threads blocking each other.  This is a great example of how you can use simple design approaches in SSIS to help make your package scale better. 

  • Visual Studio 2008 for DB Pros "GDR" Release Announced

    At last, Visual Studio 2008 for DB Pros is fully in sync with the SQL Server 2008 release.  The new "GDR" release sports a lot of new interesting capabilities as well.  Full details can be found on Gert Drapers blog post here.

  • Real World SQL Server 2008: Inside the Olympics Video Monitoring Project

    This summer I had the pleasure of teaming up with Eric Schmidt, Director of DPE's Media and Advertising Initiatives team, to provide a set of monitoring services around the Silverlight 2.0 online broadcast event of the 2008 Summer Olympics at www.nbcolympics.com.  It was the coolest thing I've worked on in quite some time, and gave me the opportunity to deeply exploit lots of new features in SQL Server 2008 to build a massively scalable web based monitoring system that helped the NBC operations team at 30 Rock detect, diagnose and resolve issues affecting user experience.

    The project was a big success from a lot of different perspectives, and we learned an awful lot doing it.  If you are interested in learning more, check out the following PDC 2008 session at www.microsoftpdc.com:

    BB07 SQL Server 2008: Developing Large Scale Web Applications and Services

    I co-presented this session with my colleagues Jose Blakeley, Partner Architect with the SQL Server Engine team, and Hala Al-Adwan, VP of Data at MySpace.   Jose was the principle presenter and talked about how with some careful engineering, SQL Server 2008 can be used to build and deploy massively scalable web applications.  My presentation focused on the Olympics project as a case study for messaging design patterns to achieve massive scale, and Hala's presentation focused on data partitioning to achieve massive scale. My demo focuses on the SQL Server 2008 backend architecture for the monitoring system.  You can download a recording of the entire session here.

    To learn about the client piece of this project, including Silverlight 2.0 instrumentation, check out the following PDC 2008 presentation at www.microsoftpdc.com:

    PC39 Inside the Olympics: An Architecture and Development Review

    In this talk, my Platform Evangelism colleagues Eric Schmidt and Jason Suess will discuss the project as a whole.

    I plan to do some blogging on some key discoveries I made as I worked on this project so stay tuned.

  • SQL Server 2008 Top 10 List for Developers

    Well now that SQL Server 2008 has shipped, Zach and I are starting to put the finishing touches on all of our SQL Server 2008 early adopter initiatives and move on to planning for future releases.  Before we do that I have some unfinished business to take care of.

    My boss, Mauricio, has been on me for months asking me to list the top 10 features of SQL Server 2008.  I have been stalling.  I think every single feature in the product is compelling, and inevitably I am going to leave something out and will hear about it later from the team that built it.  But with PDC '08 just around the corner, it's now or never.

    So let me qualify my post by saying that this list represents the 10 things I talk to developers about when I discuss SQL Server 2008.  Often I don't have the luxury of spending all day (which I could) having this discussion, so I have to pick a few things that I think will stick.    

    1. Location Awareness via Spatial Technologies: The introduction of the .NET CLR types for Geography and Geometry enable a whole new class of location-aware database applications.  These types are first-class citizens in SQL2K8, and offer direct manipulation in Transact-SQL, programmatic manipulation in .NET, as well as a host of flexible indexing options.  See my post here. Note that this post is a bit dated, points are now expressed in lat/long order in the final release.  While this is a version 1.0 implementation, it feels much more mature, and we already have several customers and partners implementing production solutions.
    2. Improved BLOB Support via FILESTREAM: Finally, a high performance streaming option for BLOBs (binary large objects) in SQL Server. This is exposed as an attribute on varbinary(max), and uses NTFS to persist the BLOBs in chunks rather than cluttering up your precious data and index pages, SQL Server takes care of the garbage collection.  It's fully integrated from a transactional and management perspective and supports backup/restore with your database data.  Getting at the data programmatically requires some rethinking.  If you are familiar with Win32 file streaming API's (which are the highest performance way to access BLOB data in Windows) you will have a head start.  Native code developers will find new FILESTREAM API's in SNAC (SQL Native Access Client), and managed code developers will find a new SqlFileStream managed class in Visual Studio 2008 SP1. See my post here. Note that this post is a bit dated as well, the managed code api's were not yet available. 
    3. Improved Date and Time Types: Quentin Clark told me this was the single largest engineering effort for SQL Server 2008.  I started thinking about it and realized how pervasive a change to a data type is, from the storage engine, to the query processor, to data programmability API's, to backward compatibility.  I have to say I'm glad the SQL team decided to take the hit, these new types have already come to the rescue in a couple of projects I've worked on.  DateTime2 is the "new and improved" DateTime, with up to seven digits of accuracy (100 nanoseconds) and improved range with no goofy start date in the 1700's.  Alas the last of the Sybase holdovers appears to be slipping away!!  DateTimeOffset is the new "timezone aware" version of DateTime2, allowing you to save a timezone offset value along with your DateTime2 value.  Now datetime calculations can include values from different timezones.  Tired or storing that extra baggage if you just want to store a date or a time?  Enter the Date and Time types.  There are definitely some nuances to using these new types and converting systems from the old types to the new types.  You should manipulate them programmatically with care, remember that these new types have increased precision and you don't want to lose data during a conversation operation.  Also, from a .NET data programmability perspective, you'll want to use the standard System.DateTime and System.DateTimeOffset (introduced in the .NET Framework 3.0)  types to host these values, as the old .NET System.Data.SqlDateTime type only handles old DateTime values.  If you need to support null values, you'll need to implement them as .NET Nullable types.
    4. CLR Everywhere: While not exactly new for 2008, support for managed code in all of the core services has improved significantly.  So much so that the SQL team itself is starting to build core capabilities using managed code, Spatial Types are just one example of this.  On a recent project, I was able to take some C# functions I wrote and use them in the database engine, in Reporting Services and in Integration Services.  The ability to break out of the core server programmability paradigms such as Transact-SQL and SSIS expressions and leverage the power of the .NET Framework is an absolute life saver.
    5. Transact-SQL Improvements:  There are many, including Table-Value Parameters (TVPs), row constructors, rich assignment operators, and the ability to do variable assignment right in the DECLARE statement.  There is an important new DML command in MERGE, a.k.a. "UPSERT" that allows you to either insert a new row or update an existing row in the same statement.  And for those relational die hards who refuse to be drug into the world of Analysis Services, a new aggregation capability called GROUPING SETS will make it easier to roll up those aggregate functions.
    6. Change Data Capture: CDC is the closest thing to a log reader that you will ever see from Microsoft.  We open up some virtual tables against the transaction log and surface any changes to affected tables, which you can then grab in your application and do interesting things with.  CDC is a low-level thing that you would use if you want to build your own sync capabilities.  SQL Server Replication is what you would use if you don't want to have to build all that plumbing. 
    7. Change Tracking:  Applications that support occasionally connected sync capabilities have to track a lot of metadata to make sure that offline changes get applied correctly and to handle conflicts.  In SQL2K8 you can enable change tracking at the table level, and SQL Server will manage all that for you.  When used in conjunction with the Microsoft Sync Framework, developers now have fine-grained sync support all the way from the database engine out to the API level to build rich occasionally connected solutions.  For SQL Server-only installations that don't need a lot of fine-grained control, SQL Server Merge Replication may still be the solution of choice. 
    8. LINQ for SQL:  Application developers who are not database developers have long wished for a more tightly bound way to express SQL in their source code.  Dynamic SQL can't be checked at compile time, and stored procedures aren't visible in application source code.  Enter LINQ for SQL, or "Language Integrated Query".  LINQ for SQL allows developers to code their queries against local pre-generated classes, not remote database objects.  The classes isolate the developer from the specifics of where the data is stored and in what format, and offer a higher degree of type fidelity and developer productivity.  With that isolation comes certain tradeoffs, such as fine grained control over the SQL that gets generated, update semantics etc. Most solutions will be a blend of LINQ and dynamic SQL, but one thing is for sure, LINQ is here to stay.
    9. Entity Framework: The Entity Framework offers an even higher level of abstraction to collections of "entities".  These entities are modeled prior to developing against them, and highly productive programming paradigms (such as LINQ for Entities) are enabled on top of them which further isolate the developer from ugly data access plumbing and physical database goo.  While this is definitely version 1.0 technology, it is clear that the data programmability team is very serious about moving this technology into the future, so give it a good look now so you can understand how it works.  The same tradeoffs apply here as with LINQ to SQL, but at a larger degree of abstraction.  You can see various incarnations of the Entity Framework showing up in other projects, such as ADO.NET Data Services.
    10. BI Programmability Enhancements: This category needs another top 10 list, but in my world every developer should know how to build integrated reporting and analytics solutions, and therefore should know about what's new for developers in SQL Server 2008 Integration Services, Reporting Services and Analysis Services.  Here's my top picks for each of these three important components
      1. Integration Services: Support for C# scripts, especially in the Transformation Script Components, is an absolute life saver. Lookup Transformations have been completely re-worked for the better, with much improved caching options and the ability to share cache's across data flows and packages.  I also really like the new Data Profiling Task which can help you get a handle on a new data source very quickly.
      2. Reporting Services: Big investments have been made in providing a more productive report design experience for both the Visual Studio based "BI Development Studio" and the new stand-alone report designer known as "Report Builder 2.0".  The new RDL Tablix element make's it easier to build complex reports that may have previously required multiple data regions and or reports.
      3. Analysis Services: Our flagship OLAP server, and its integrated Data Mining functionality have seen a lot of improvements.  I particularly like the design-time guidance when building cubes, reminding you on the right way to build attribute hierarchies and the like before you shoot yourself in the foot.  From a data mining perspective, the time series algorithm has been updated and should see a lot of use, and the data mining add-ins for Excel 2007 have also been improved.

    Whew, I'm glad I have that done!  Best of luck to all you eager database developers out there, I hope your experience with SQL Server 2008 will be as exciting and interesting as mine has.

  • Tessellation?

    Okay, it's time to come clean.  As a technical evangelist, particularly one that focuses on driving platform adoption, I have to be familiar with an ever-increasing array of technologies.  Compound that with the aging process, vast quantities of fried brain cells from my misspent youth and various and sundry character flaws (such as the over use of superlatives), and the result is that I don't always have a complete understanding of everything of which I speak.  Shocking, I know...

    Usually I do a pretty good job of motoring past these soft spots and steering the conversation into more familiar territory before whomever I'm talking with flips my bozo bit.  But sometimes there's just no way out and I get caught.  I had just such an experience last week, and in the interest of full disclosure (and laughs) I thought I'd come clean.

    So I'm doing this podcast for Greg Low of Solid Quality Mentors in Australia on SQL Server 2008 developer enhancements.  Now Greg's a pretty sharp guy, and I know going in that he's not going to learn much from me, but what the heck its a great chance to reach out to the developer community and get people excited about SQL Server 2008.  The interview is going pretty well, and Greg's doing a great job of filling in the weak spots, until we reach the end and start talking about spatial support in SQL Server 2008

    Then I did it.  I used the word "tessellation".  As soon as the word came out of my mouth I regretted it.  Greg perked up and asked me to define it, at which point I fell on my sword and said I had absolutely no clue. 

    DECLARE @rogerBozoBit [bit] = 1;

    We both had a chuckle, and he offered to remove it from the podcast.  I told him I thought he should leave it in for laughs, but he removed it after all.  Thanks Greg for protecting my dignity, but I have none.  Besides, I didn't want to loose the opportunity to actually define the word so that other fast talking evangelists won't fall into the same trap.  There's an excellent discussion of the topic in the SQL Server 2008 books online that I will try to summarize here.

    Generally speaking, tessellation is a technique used to simplify the manipulation of complex structures by forming the structure out of a bunch of polygons.   In the context of SQL Server 2008 spatial support, tessellation comes into play when you are talking about spatial indexes.  By all accounts we've provided a pretty powerful and flexible indexing capability for spatial data that decomposes an indexed space into a grid hierarchy.  When searching for a match, SQL Server uses tessellation to fit an instance of spatial data into this grid hierarchy.  In this case, where a shape is being compared to a grid hierarchy, SQL Server uses various tessellation rules to see how many cells in the the grid hierarchy are touched by the current shape being compared to the index, and how deep down the grid hierarchy the search needs to go.  SQL Server provides a couple of different tessellation options when creating a spatial index which map directly to usage of the geometry vs. the geography type.

    So there, I've grossly oversimplified tessellation so the term can be butchered by evangelists worldwide.  

  • Demystifying Spatial Support in SQL Server 2008

    In our last series of posts Zach and I talked about using SQL Server's new FILESTREAM support to store, manage and retrieve media in SQL Server to enable rich user experiences.  We've been at it again this week trying to get some traction with the new spatial data types and query semantics in SQL Server 2008.  This new functionality makes it possible to integrate location awareness into just about any type of application.  What could be more compelling than allowing a user to interact with information that is relevant to locations that they care about, like home, work, school or vacation destinations?  This technology has long been the domain of sophisticated GIS applications, but by integrating spatial capabilities into SQL Server 2008, Microsoft is making location awareness available to the average database developer.

    Zach has some experience with GIS applications, but I have none at all.  So the challenge was whether a couple of database geeks could figure out how to leverage this spatial stuff in an application in a couple of days.  I'm happy to say that we made great progress and we'd like to share some of our experiences in hopes that it will demystify the spatial functionality in SQL Server 2008.  Rather than regurgitate the same stuff you can read in books online when CTP5 comes out, I'm going to try to focus on some of the key discoveries we made in hopes that it might save you some time.

    The scenario we focused on was "geo-locating" the media we stored in SQL Server 2008 in our previous FILESTREAM sample.  By tagging these images and videos with some location information, it's possible to do spatial queries that retrieve all the pictures that were taken in a particular geographic area.  There are a ton of cool applications for this kind of functionality, I'll leave it to your imagination to think of some. 

    I figured a logical place to start was by simply adding a geography column to our eventMedia table like so:

    CREATE TABLE [dbo].[eventMedia] (
        [mediaId] [bigint] NOT NULL IDENTITY PRIMARY KEY,
        [mediaKey] [uniqueidentifier] NOT NULL ROWGUIDCOL UNIQUE,
        [title] [nvarchar](256) NOT NULL,
        [dateCreated] [datetime] NOT NULL DEFAULT(GETDATE()),
        [createdBy] [nvarchar](256) NOT NULL,
        [fileName] [nvarchar](256) NOT NULL,
        [contentType] [nvarchar](256) NOT NULL,
        [location] [geography] NOT NULL,
        [binaryFile] [varbinary](max) FILESTREAM DEFAULT(0x));
    GO

    Next I started thinking about how to get shapes into our new geography column.  Turns out it was a whole lot easier than I expected.  Under the covers, the geography type is implemented as a UDT.  That means geography is not a scalar type like an integer or date, rather its an object with methods and properties.  This is a great example of how the .NET Framework Common Language Runtime integration introduced in SQL Server 2005 is now enabling a whole new generation of features and capabilities in SQL Server 2008. 

    Everyone knows how to assign values to scalar types, but how do you create an instance of an object in Transact-SQL?  In the case of the geography type there are a variety of ways, but the simplest that I found is the Parse() method.  This takes a string representation of a shape as an argument.  The format of that string is defined by the Well Known Text (WKT) specification published by the OGC.  For example, here's some Transact-SQL code that creates an instance of a geometry object and initializes its value to a POINT shape:

    DECLARE @g [geography];
    SET @g = geography::Parse('POINT (39.96296 -75.16834)');
    PRINT @g.ToString();

    In this example the point shape's coordinates are the latitude and longitude of my old apartment building in Philadelphia.  It's worth mentioning that the geography type utilizes a default coordinate system known as WGS 84 which is used by most GPS systems.  Once you've initialized a geography instance, there's a ton of methods and properties you can use to interrogate them, compare them, and create new instances from existing ones.  Most of these capabilities are defined in the OGC's Simple Features for SQL Specification

    The Parse() method is great because it can deal with all of the different types of shapes, allowing me to code a single INSERT statement that can load any kind of valid shape into the geography column.  Here's some C# code we used to insert geography data into our our eventMedia table:

    SqlCommand cmd = new SqlCommand("INSERT [dbo].[eventMedia]([mediaKey], [title], [createdBy], [fileName], [contentType], [location]) VALUES( @mediaKey, @title, @createdBy, @fileName, @contentType, geography::Parse(@location));", cxn, txn);
    cmd.Parameters.Add("@mediaKey", SqlDbType.UniqueIdentifier).Value = mediaKey;
    cmd.Parameters.Add("@title", SqlDbType.NVarChar, 256).Value = title;
    cmd.Parameters.Add("@createdBy", SqlDbType.NVarChar, 256).Value = createdBy;
    cmd.Parameters.Add("@fileName", SqlDbType.NVarChar, 256).Value = sourceFileName;
    cmd.Parameters.Add("@contentType", SqlDbType.NVarChar, 256).Value = contentType;
    cmd.Parameters.Add("@location", SqlDbType.NVarChar).Value = location;

    Next we started thinking about what kind of shapes to load into our new geography column.  My first inclination was to start loading a bunch of map data.  Zach and I really didn't want to write a bunch of graphics code to draw maps, there are great services out there like Virtual Earth that draw maps way better than we could.  We quickly realized that there really wasn't any need to load a whole map, all we needed to demonstrate "geo-locating" our media files was to create some shapes that represent the map and regions that we are interested in, then tag each of our media files with a specific location within that map.  To keep things simple, we decided to store those shapes in the same table as our media files so we could query them all in the same place. 

    Our demo design uses the following "types" of rows in the eventMedia table:

    • Map row: There is only one of these.  It's a rectangle represented as a geography POLYGON instance whose four points are the lat/long coordinates of our total map area.  For good measure we store an image of the map region from a map drawn in Microsoft Streets & Trips.  I simply drew a rectangle on top of an existing map in Streets and Trips, then used the location sensor tool in Streets & Trips to get the lat/long coordinates of the four corners.
    • Region rows: There are three of these.  They are irregular closed polygons that define regions of interest contained within the overall map.  Like the map, we store an image of the region I created in Streets & Trips along with the lat/long coordinates for each of the points in the polygons.
    • Media rows:  There are several of these, which represent pictures stored as varbinary(max) FILESTREAM instances.  We tag them with location information using a geography POINT shape whose lat/long coordinates correspond to the location where the picture was taken.  By comparing these points to our regions, the demo shows how you can retrieve media based upon its location within a map or region of interest.

    For the demo we decided to focus on the Pro Cycling Tour International Championship in Philadelphia, PA.  It's relatively self-contained and has some cool sections that correspond neatly to our concept of "regions of interest".  

    Map and Region Rows

    Region Map Race Area Wall Area Parkway Area
    Description This is the total map area we will deal with in this sample.  This helps constrain things to a specific geography in Philadelphia rather than the whole earth. The entire Pro Cycling Championship race route is contained inside this irregular polygon. One of the most interesting parts of the race route.  Contains the infamous 17% grade climb known as the "Manayunk Wall". The race starts and finishes here on the Benjamin Franklin Parkway.
    JPEG Image: map_area race_area wall_area parkway_area
    WKT Shape Def POLYGON((
    39.95372 -75.23177, 39.95372 -75.16456, 40.03618 -75.16456, 40.03618 -75.23177, 39.95372 -75.23177))
    POLYGON((
    39.95601 -75.17031, 39.95778 -75.16786, 39.97789 -75.18870, 39.99237 -75.18521, 40.00677 -75.18603, 40.01136 -75.19922, 40.03142 -75.21746, 40.02586 -75.22534, 40.01430 -75.21052, 40.00634 -75.19192, 39.99570 -75.19248, 39.98374 -75.20526, 39.97704 -75.19437, 39.96920 -75.19087, 39.95601 -75.17031))
    POLYGON((
    40.02387 -75.22280, 40.02810 -75.21442, 40.03142 -75.21746, 40.02586 -75.22534, 40.02387 -75.22280))
    POLYGON((
    39.95601 -75.17031, 39.95778 -75.16786, 39.96874 -75.17921, 39.96512 -75.18441, 39.95601 -75.17031))

    Media Rows

    Media Parkway Photo Wall Photo (Bottom) Wall Photo (Top) Logan Circle Photo
    Description Taken on the Ben Franklin Parkway near the finish line. This shot was taken from the bottom of the Manayunk Wall. This shot was taken at the top of the Manayunk Wall. This is another shot from the Benjamin Franklin Parkway.
    JPEG Image: bike9
    wall_race wall_race2 parkway_area2
    WKT Shape Def POINT (39.96045 -75.17396) POINT (40.02593 -75.22457) POINT (40.02920 -75.21986)
    POINT (39.95813 -75.17052)

    One interesting thing I ran into when creating the geography polygons was that you have to define the polygon's points in a counter-clockwise fashion to have the proper "ring orientation".  If you define geography points in a clockwise fashion you will get the following error:

    Msg 6522, Level 16, State 1, Line 2
    A .NET Framework error occurred during execution of user defined routine or aggregate 'geography':
    Microsoft.SqlServer.Types.GLArgumentException: 24205: The specified input does not represent a valid geography instance because it exceeds a single hemisphere. Each geography instance must fit inside a single hemisphere. A common reason for this error is that a polygon has the wrong ring orientation.
    Microsoft.SqlServer.Types.GLArgumentException:
       at Microsoft.SqlServer.Types.GLNativeMethods.ThrowExceptionForHr(GL_HResult errorCode)
       at Microsoft.SqlServer.Types.GLNativeMethods.GeodeticIsValid(GeometryData g)
       at Microsoft.SqlServer.Types.SqlGeography.IsValidExpensive()
       at Microsoft.SqlServer.Types.SqlGeography.ConstructGeographyFromUserInput(GeometryData g, Int32 srid)
       at Microsoft.SqlServer.Types.SqlGeography.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
       at Microsoft.SqlServer.Types.SqlGeography.Parse(SqlString s).

    I don't confess to understand this 100%, but Ed Katibah told me its kind of like turning the world inside-out. Ed also mentioned that sister data type geometry polygons don't have this limitation and can be loaded in either order (clockwise or counter-clockwise).  Without getting into too much detail here, that led to a conversation about what the right type was to use for our demo.  I had thought that geometry was limited to standard x and y coordinates and wouldn't handle lat/long coordinates, but was surprised to learn that it does.  The basic difference between the two is that geography types account for the curvature of the earth, while geometry types don't.  That means for relatively small surface areas they are roughly equivalent, but if you are dealing with larger surface areas you will definitely want to stick with the geography type.

    Another important thing to remember about polygons is that you need to "close" them with a final point that is the same as the first point in the polygon.  If you don't do this you get the following error:

    Msg 6522, Level 16, State 1, Line 2
    A .NET Framework error occurred during execution of user defined routine or aggregate 'geography':
    System.FormatException: 24119: The Polygon input is not valid because the start and end points of the exterior ring are not the same. Each ring of a polygon must have the same start and end points.
    System.FormatException:
       at Microsoft.SqlServer.Types.GeometryDataBuilder.EndFigure()
       at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringText(FigureAttributes attributes)
       at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePolygonText()
       at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePolygonTaggedText()
       at Microsoft.SqlServer.Types.OpenGisWktReader.ParseGeometryTaggedText()
       at Microsoft.SqlServer.Types.OpenGisWktReader.ReadGeometry()
       at Microsoft.SqlServer.Types.SqlGeography.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
       at Microsoft.SqlServer.Types.SqlGeography.Parse(SqlString s).

    This points out one of my few beefs about the CLR implementation in SQL Server, and that is that CLR exceptions always generate the same Transact-SQL error number (6522).  Somebody on the CLR team once explained to me why it has to be this way, I think it has to do with limitations of the existing Transact-SQL error handling framework and RAISERROR in comparison to the extremely rich exception handling capabilities in the .NET Framework.  I'm sure my buddy Bob Beauchemin would be able to explain this in gory detail.

    One last tip about polygons, and any other shape that uses lat/long coordinates.  I made several typos when initially recording my lat/long coordinates, resulting in less than satisfying results for the demo.  I called Ed to get some debugging recommendations.  Ed mentioned a great spatial partner named Safe Software who is working on a beta version of their spatial ETL tool named FME, which can be used to move spatial data in and out of SQL Server 2008 from other formats.  One of their components will actually allow you to visualize shapes stored in a geography column, making it a heck of a lot easier to determine if the polygon you thought you created is actually the polygon you created. 

    Anyway back to the matter at hand, we created a .NET Framework command-line utility named FileStreamLoader designed to populate our eventMedia table.  This was the easiest way for us to pump in all of our pictures and videos using FILESTREAM.  It's tough to do that kind of stuff in a Transact-SQL script, so database geeks please go learn .NET programming it will make your life a whole lot easier.  I'll post the code for FileStreamLoader up on CodePlex once CTP5 ships.

    So after loading up our map/region rows using FileStreamLoader I started running some queries to see how they worked.  Here's a good one that retrieves all of the polygons from eventMedia:

    SELECT
        N'Map/Region Row' As [Label],
        [mediaKey] As [Key],
        [title] As [Title],
        ROUND([location].STArea(),2) AS [Area],
        ROUND([location].STLength(),2) AS [Length],
        [location].ToString() AS [Location]
    FROM
        [dbo].[eventMedia]
    WHERE
        [location].STDimension() = 2;

    I used a few spatial methods to illustrate some of the rich functionality you get when using the geography type.  The STArea() values are the total area of the polygon in square meters.  The STLength() values are the total length around the polygon in meters.  The ToString() values are WKT string representation of the geography instance.  I used the STDimension() method to filter the result set down to just 2-dimensional objects.  The results look like this:

    Label Key Title Area Length Location
    Map/Region Row cc80cfee-bc14-435b-97ee-6dc1f2ca8f9c Map

    52552559.9

    29791.3 POLYGON ((39.95372 -75.23177, 39.95372 -75.16456, 40.03618 -75.16456, 40.03618 -75.23177, 39.95372 -75.23177))
    Map/Region Row 1254b31c-c37b-4d12-86da-9404fd9e42cd Race Area 6432902.35 22165.07 POLYGON ((39.95601 -75.17031, 39.95778 -75.16786, 39.97789 -75.1887, 39.99237 -75.18521, 40.00677 -75.18603, 40.01136 -75.19922, 40.03142 -75.21746, 40.02586 -75.22534, 40.0143 -75.21052, 40.00634 -75.19192, 39.9957 -75.19248, 39.98374 -75.20526, 39.97704 -75.19437, 39.9692 -75.19087, 39.95601 -75.17031))
    Map/Region Row ffd2bfed-195c-46da-8b6b-0a8728d8a43a Parkway Area 689476.79 4015.39 POLYGON ((39.95601 -75.17031, 39.95778 -75.16786, 39.96874 -75.17921, 39.96512 -75.18441, 39.95601 -75.17031))
    Map/Region Row dbd03037-7451-4451-8a26-b7b95fd2486f Wall Area 334024.82 2529.11 POLYGON ((40.02387 -75.2228, 40.0281 -75.21442, 40.03142 -75.21746, 40.02586 -75.22534, 40.02387 -75.2228))

    The demo application needs to determine what the overall map space is.  We can hard code this, but its better if the app can determine it dynamically using a spatial query.  This query gets the map region from the table, which is the 2-dimensional geography object with the biggest area:

    DECLARE @mapKey [uniqueidentifier];
    DECLARE @map [geography];

    SELECT TOP 1
        @mapKey = [mediaKey],
        @map = [location]
    FROM
        [dbo].[eventMedia]
    WHERE
        [location].STDimension() = 2
    ORDER BY
        [location].STArea() DESC;

    Next, the demo application needs a list of sub-regions within the overall map space.  Again, its best if the app can get this list using a spatial query:

    SELECT
        N'Region Row' AS [Label],
        [mediaKey] AS [Key],
        [title] AS [Title]
    FROM
        [dbo].[eventMedia]
    WHERE
        ([mediaKey] <> @mapKey) AND
        ([location].STDimension() = 2) AND
        ([location].STIntersects(@map) = 1);

    The STIntersects() function is one of the most important functions for enabling spatial queries.  In this case, each 2-dimensional geography instance in the location column is evaluated against the map instance to see if they intersect. The results represent the set of "sub-regions" in our map:

    Label Key Title
    Region Row ffd2bfed-195c-46da-8b6b-0a8728d8a43a Parkway Area
    Region Row 1254b31c-c37b-4d12-86da-9404fd9e42cd Race Area
    Region Row dbd03037-7451-4451-8a26-b7b95fd2486f Wall Area

    While I don't discuss it in this blog post, its important to note that the STIntersects() method supports usage of the new spatial index type in SQL Server 2008.  We didn't really need one for such a small number of rows, but when dealing with larger scale GIS systems indexing can become critical. 

    Finally, on to the basic spatial query the demo application needs to support.  We need to retrieve a list of pictures that were taken within the boundaries of our map.  This query gets all of the pictures that have been geo-located within our map:

    SELECT
        N'Map Picture Row' AS [Label],
        [mediaKey] AS [Key],
        [title] AS [Title],
        [location].ToString() AS [Location],
        LEN([binaryFile]) AS [File Size]
    FROM
        [dbo].[eventMedia]
    WHERE
        ([contentType] = N'image/jpeg') AND
        ([location].STDimension() = 0) AND
        ([location].STIntersects(@map) = 1);

    The results look like this:

    Label Key Title Location File Size
    Map Picture Row 1135AEF5-E36A-40F5-AA31-9AFD8808C7F9 Finish Line POINT (39.96045 -75.17396) 44826
    Map Picture Row 181C2858-D0A3-44FF-BF6B-5BE571306EDA The Manayunk Wall POINT (40.02593 -75.22457) 49931
    Map Picture Row 62DF5D51-60B1-45D0-93B8-76A1D2D76CCB Top of the Manayunk Wall POINT (40.0292 -75.21986) 510277
    Map Picture Row CD0D0373-B706-48C4-9C99-BD22470A4F44 Logan Circle POINT (39.95813 -75.17052) 46612

    Since the pictures are stored as a FILESTREAM, the demo application can go out and retrieve them using new Win32 file streaming API support in SQL Server 2008.  Now let's do a final spatial query that gets all of the pictures in one of our "areas of interest", namely the Manayunk Wall:

    DECLARE @wallArea [geography];

    SELECT
        @wallArea = [location]
    FROM
        [dbo].[eventMedia]
    WHERE
        [wallKey] = 'dbd03037-7451-4451-8a26-b7b95fd2486f';

    SELECT
        N'Wall Picture Row' AS [Label],
        [mediaKey] AS [Key],
        [title] AS [Title],
        [location].ToString() AS [Location],
        LEN([binaryFile]) AS [File Size]
    FROM
        [dbo].[eventMedia]
    WHERE
        ([contentType] = N'image/jpeg') AND
        ([location].STDimension() = 0) AND
        ([location].STIntersects(@wallArea) = 1);

    Here are the results:

    Label Key Title Location File Size
    Wall Picture Row 181C2858-D0A3-44FF-BF6B-5BE571306EDA The Manayunk Wall POINT (40.02593 -75.22457) 49931
    Wall Picture Row 62DF5D51-60B1-45D0-93B8-76A1D2D76CCB Top of the Manayunk Wall POINT (40.0292 -75.21986) 510277

    And there we have it!  These are the two pictures that were taken in the Manayunk Wall area.  Obviously the demo gets more interesting with more photos and more regions.

    Now you know how to use the new spatial functionality in SQL Server 2008 to geo-locate media files in your application.  Hopefully all you database developers out there won't be afraid to dive in and start using this amazing new functionality.  Zach is going to do some additional posts in our spatial series which discuss how to leverage these new spatial features within a WPF application.

    - Roger

  • Handling NULL instances of varbinary(max) FILESTREAM columns

    The table structure I proposed in my last post used the following column definition for my FILESTREAM column:

    CREATE TABLE [dbo].[eventMedia] (
    ...
        [file] [varbinary](max) FILESTREAM);

    So by default I am allowing NULL's to be stored in the file column.

    Zach and I changed the design a bit to the following:

    CREATE TABLE [dbo].[eventMedia] (
    ...
        [file] [varbinary](max) FILESTREAM DEFAULT(0x));

    So by default we are initializing every file instance to an empty binary string. The net effect of this change is a zero-length file will get created in the file system when new rows get inserted into the eventMedia table.  So why did we do this?

    Without stealing Zach's thunder, you can't open up a handle to an instance if the column is NULL.  If you can't open up a handle to it, that means you can't write to it either, which will result in a dead-end.   So all you DBA's creating FILESTREAM columns take note!

    - Roger

  • Getting Traction with SQL Server 2008 Filestream

    Zach and I just spent a couple of days figuring out how to make use of the new FILESTREAM support in SQL Server 2008 and we thought we'd share a little bit about the experience in hopes it might save somebody some time.

    There's a ton of information out there regarding FILESTREAM, but in case you need more detail you can check out some of the SQL Server 2008 CTP content on Connect.  FILESTREAM support will be enabled in CTP5.  From the perspective of user experience, FILESTREAM is going to enable some interesting scenarios.

    SQL Server has always provided the capability to store binary data, and thus you could grab any type of file and stuff it into a SQL Server varbinary(max) column.  But blobs have different usage patterns than relational data, and SQL Server's storage engine is primarily concerned with doing I/O on relational data stored in pages and extents, not streaming blobs. 

    So the bottom line is that storing blob's in SQL Server have always had some limitations from a performance perspective.  Most developers resorted to storing files in the file system, then just storing a path to the file in the database.  Perfectly legitimate approach, and for some apps may still be the right way to go even with the advent of FILESTREAM.

    But when you do that you introduce a whole new set of challenges around manageability, backup and concurrency that should already be obvious to the reader.  Enter FILESTREAM.

    Now SQL Server 2008 can store blobs in its own private namespace on the local NTFS filesystem instead of in-line with relational data.  That's good cuz the NTFS file system was built to stream blobs.  NTFS is even more interesting form a database perspective because it's transactional and supports recovery.  So you can imagine the SQL Server Storage Engine and NTFS having a little mutual self-respect love fest.

    Again this has obvious advantages from a manageability, backup and concurrency perspective that I won't reiterate here.  But what I will do is talk about our first experience with FILESTREAM from a database perspective, and Zach's gonna talk about it from an application development perspective.

    Our first project was to wire up a rich WPF app written in Visual Studio 2008 with some video stored in FILESTREAM in a pre-release build of SQL Server 2008 CTP5.  We quickly found out that you can't connect to CTP5 from managed code written in Visual Studio 2008 Beta 2 due to a TDS compatibility issue.  Once we tracked down a release candidate of Visual Studio 2008 we were good to go.

    Next I went into SSMS and built out a database to store our video.  My first surprise was IntelliSense support in the Transact-SQL Editor!  Check it out:

    tsqlintellisense

    In order to use the FILESTREAM attribute on a varbinary(max) column you have to enable FILESTREAM support for the SQL Server instance using:

    EXEC [sp_filestream_configure] @enable_level = 3;

    At this point I got a message that I needed to reboot the server in order for the setting to take effect.  The dev team is working on removing this requirement for RTM.

    Next I created the database.  In order to use FILESTREAM you have to have a special filegroup for storing the FILSTREAM data which I called FileStreamGroup1:

    CREATE DATABASE AdventureWorksRacing ON PRIMARY
      ( NAME = AdventureWorksRacing_data,
        FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\AdventureWorksRacing_data.mdf',
        SIZE = 2MB,
        MAXSIZE = 50MB,
        FILEGROWTH = 15%),
    FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM
      ( NAME = AdventureWorksRacing_media,
        FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\AdventureWorksRacing_media')
    LOG ON
      ( NAME = AdventureWorksRacing_log,
        FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\AdventureWorksRacing_log.mdf',
        SIZE = 5MB,
        MAXSIZE = 25MB,
        FILEGROWTH = 5MB);
    GO

    When I went out to the AdventureWorksRacing_media folder there was some initial NTFS folders with GUID's for names, and some header files and log folders.  This is all storage engine gobbledy-goop for creating the FILESTREAM namespace. 

    Next I created a table to store our video with a FILESTREAM column:

    CREATE TABLE [dbo].[eventMedia] (
        [mediaId] [uniqueidentifier] NOT NULL ROWGUIDCOL PRIMARY KEY,
        [dateCreated] [datetime] NOT NULL DEFAULT(GETDATE()),
        [createdBy] [nvarchar](256) NOT NULL,
        [fileName] [nvarchar](256) NOT NULL,
        [mediaType] [nvarchar](256) NOT NULL,
        [location] [geometry] NULL,
        [file] [varbinary](max) FILESTREAM);
    GO;

    Tables with FILESTREAM columns required a ROWGUIDCOL column.  This is used by the storage engine to keep track of instances in the filesystem.  Next I wrote some TSQL to insert some garbage to see the affect in the AdventureWorksRacing_media folder.  I'm not showing that code here because every lame FILESTREAM sample shows it and it's worthless and makes me crazy.  The correct way to insert data is using a native or managed client that can actually put real binary data into the column as opposed to a hello world string cast as binary data.

    Anyway I digress...

    After inserting some garbage I saw some new folders and files:

    filestreamdirs 

    Next I tried to delete some rows and see the affect on the file system.  Nothing changed right off the bat, and when I talked to the devs about it they said that storage gets freed up on a filestream filegroup when a valid log truncation point occurs.  Otherwise you wouldn't get proper backup/restore behavior.  This made perfect sense to me!

    One quick word of warning, trying to open up a query results grid in SSMS on gobs of filestream data is not a good idea.  Take it from me.  As usual, our tools are powerful and flexible enough to enable a dummy like me to shoot myself in the foot.

    Anywho we were up and running with a database to store our videos.  Next up Zach's gonna talk about how we wired our WPF client up to it to play videos in this post.

  • SQL Server 2008 - The Engine that Powers Immersive User Experiences

    One of the things Zach and I are trying to accomplish is to get developers to understand that SQL Server 2008 isn't just some bland bit bucket for storing and retrieving data.  It's THE engine that powers immersive user experiences!  Catchy huh?  I'll give Zach the credit for coming up with that one.

    Whether you are talking about applications that serve up documents, rich media or maps, SQL Server 2008 has rich data types, powerful search, and flexible data management capabilities that make it the ideal platform for building applications for Windows and the next web now.

    Zach and I got our greedy little hands on an early build of CTP5 and are busy building some applications that exploit these new features and capabilities.  We'll be blogging about that and other stuff over the next few weeks.

    - Roger

More Posts Next page »

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker