SQL As a Federation Database

One of the nice things you can do with SQL Server 2005 is utilize it as a central federating data server. In this manner of usage, SQL Server acts as the gateway to connect to data that is distributed in different sources across the enterprise.

There are a couple options for how you approach it. One option is to bulk load data an external source into SQL Server, producing a copy of that data. SQL Server Integration Services (SSIS) is a nice set of capabilities to help do this. Applications can then query that data, enabling business intelligence applications and business analysis efforts, with high performance because all the data is located on the SQL Server itself. The downside: as soon as you make a copy fo the data it begins to go stale; you may need to update the copy periodically to counter this. (SSIS let's you schedule a data copy task if you like, to address this). Another option is to set up a reference from SQL Server to the external data source. In SQL Server parlance, this is called setting up a linked server. In this option the data is not copied to SQL Server, but it is accessible from queries made on the SQL Server. The downside here? When performing queries on the remote data, you'll likely get poorer performance, because there's an extra hop in the chain. On the other hand with this latter approach, there is no duplication of data, and you won't have the staleness problem. I've described how to do this previously.

Why would you want to do this? Assuming you have data in, let's say, a remote Oracle, or a remote DB2, why would you want to access it via SQL Server, either by copying it or by referencing it as remote data? This really comes down to applications. If your application - let's say it is an Excel spreadsheet that uses a database query to grab data from a remote database, or maybe it is a databound web part in a Sharepoint page - if the application design assumes a single database as a data source, then it makes sense to collect all of the data into a single central point.

Is this Truly Interop?

This blog is ostensibly "all about .NET Interop", but this particular post is msotly about using SQL Server to federate disparate data sources. It's a bit of a stretch to cast data federation as Interop, but not too much of a stretch. Look, most .NET applications - whether a Windows Forms app, a WCF Service, a Sharepoint web part, or something else - most of these applications connect to data, and most data in the world is not in SQL Server. One option is to connect that .NET app to the non-Microsoft datasource. Another option is to use SQL Server to federate. There are other options, too. But in any case, federation plays a role in .NET interop.

How does it work?

I want to talk now about the linked server option. To set up a linked server, you can use the stored procedure sp_addlinkedsevrer, or you can define it using SQL Management Studio. Once you have the linked server defined, the query syntax for a linked server looks like this:

  SELECT * FROM LinkedServer.Catalog.Schema.Table where column = 'xxxx'

Something Interesting

Here's something interesting though: the where clause for queries to linked servers is not always remoted. huh? No, really! Several factors may cause a WHERE clause to not be remoted. The distributed query processor (DQP) engine within SQL Server is ultimately responsible for deciding whether to send the WHERE clause to the remote database. not. Among the inputs the DQP uses to make the decision: the existence of Indexes in the remote tables; the collation of the remote table; and statistics of the remote database. Using this information, the DQP builds a model for the structure of the remote data, and then makes the decision. The existence of index, statistics and collation information will increase the chances that the WHERE clause will be sent to the remote database.

This obviously has implications for performance. If the table is very large and yet has no index, then a query on a linked server can mean that lots and lots of data will be transmitted across the network. This can kill your performance.  To avoid this, follow good database practices - create indices on the remote tables where it makes sense!

A Sample of Performance Data

We defined a DB2 (running on Windows Server, but that's not important) as a linked server, and then measured the performance of queries. We used a table containing over 20 million rows. For this database, the following query returns 116,834 rows :

SELECT GROUP_NBR FROM EMIS_CLAIM_MODEL WHERE CFO_CD = '000352'

This data is from queries executed in SQL Query Analyzer:

Query elapsed time (T1) elapsed time (T2)

EXEC('SELECT GROUP_NBR FROM EMIS_CLAIM_MODEL WHERE CFO_CD = ''000352''') AT DB2V82_ENGLISH_1252

2:19 2:09

SELECT GROUP_NBR FROM DB2V82_ENGLISH_1252.PSS.PSS.EMIS_CLAIM_MODEL WHERE CFO_CD = '000352'

11:46 11:26

SELECT * FROM OPENQUERY(DB2V82_ENGLISH_1252,'SELECT GROUP_NBR FROM EMIS_CLAIM_MODEL WHERE CFO_CD = ''000352''')

2:27 2:09

And, calling a stored proc (which returns a result set for a similar query <returns a different column only> + an out parameter set by an exit handler if the query fails):

declare @myinput char(6);

declare @myreturn integer;

set @myinput = '000352';

exec('call PSS.QUERYEMIS2(?,?)',@myinput, @myreturn out) at DB2V82_ENGLISH_1252

print @myreturn;

3:08 2:28

Notes on this test: The above tests were run from a box running SQL Server 2005 Enterprise, with the Microsoft DB2OLEDB provider + sp1 fix.  Each query ran them 3 times, and we got similar results each time except on the stored procedure call where 1 test took ~1 minute longer. At the same time, I was hitting this same table from another machine running a VB application issuing a query in a loop using our HIS 2004 provider, so times might be better under less "load", but probably not that much difference in this scenario.

The T1 column is with the Defer Prepare and Derive Parameters both set to false. The T2 column is with both these parameters set to true.

Conclusions

Use indexes with the remote linked servers. Check your performance using the various query syntax options; don't assume that one syntax delivers better results than another. In our case the query syntax using four-part names delivered significantly worse performance than the other options.  Also - Response time (as measured here) is not the sole criteria for any distributed app - you may want to consider performance under real-world load.