Connecting directly to a SQL Azure database from a reporting tool (like Microsoft Excel and PowerPivot, or SQL Server Reporting Services) from your desktop or local data center is possible using a data source that looks like:
This is useful if you have a single database on SQL Azure that you need to query. What if you have reached the 50GB limit that is currently the top size of a SQL Azure database? The solution is to split your database into many databases containing the same schema. See this paper on database sharding with SQL Azure.
Expecting the users to always define multiple data sources to connect to multiple databases is cumbersome and may not perform well. For example, a PowerPivot user would have to open a separate link to every SQL Azure database individually. This paper describes a solution using a local SQL Server that can ease the access to all these databases and will usually give better performance. The solution is called Partitioned Views and is implemented using Linked Servers. Since customers don’t usually like to expose their databases to the internet, we are proposing a separate server that contains no actual data and only contains the views necessary for applications to get an internet link to SQL Azure databases. This is not the only solution because you can implement stored procedures using OPENROWSET or OPENQUERY for solutions that need more logic.
A read-only solution is the easiest to implement. If modifications must be made to the SQL Azure tables via the Distributed Partitioned Views, you must read the restrictions in SQL Server Books Online. It also helps to read about how to implement a partitioned view. Using the check constraints on the tables as shown in the examples in this link is essential if you want the optimizer to only touch the right tables when you use the constrained columns in a WHERE clause in your queries.
This concept relies on the linked server feature of SQL Server. Since ADO.NET, ODBC, PHP and JDBC are the only providers currently supported by SQL Azure, ODBC data source names (DSN) are required.
Run odbcad32.exe to setup a system DSN using SQL Server Native Client. Or go into Control Panel\System and Security\Administrative Tools and click on Data Sources (ODBC). Each database you create on SQL Azure needs a separate DSN and a separate linked server definition. Use TCP, not named pipes because you will be communicating to the SQL Azure servers through the internet.
Scale the creation of ODBC connections to multiple databases by modifying & importing a .REG file into the registry (regedit.exe) patterned after this sample:
[HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBC.INI\ODBC Data Sources]
"Azure_ODBC1"="SQL Server Native Client 10.0"
An ODBC DSN and an associated Linked Server need to be created for each Database that will be queried. Here are two sample T-SQL commands that will create a linked server and associated login:
EXEC master.dbo.sp_addlinkedserver @server = N'Azure_ODBC1',@srvproduct=N'Any', @provider=N'MSDASQL', @datasrc=N'Azure_ODBC1'
/* For security reasons the linked server remote logins password is changed to ######## */
EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'Azure_ODBC1',
By default, SQL Azure allows 149 databases per server. To exceed this threshold, either 1) create databases in other servers, or 2) contact the Azure business desk and request that your limit be increased from the default threshold. Each database can be a maximum of 50GB, giving a default total 7.45TB of storage per Server. Remember to factor in index sizes when planning your total data needs. The 50GB limit has to cover both data and index space. Log space is considered separate from the 50GB limit.
In order to make all your SQL Azure databases appear as one data source, you need to create a distributed partitioned view (DPV) in your local, on-premise SQL Server database. This database must be at least Enterprise edition because DPVs are not supported in Standard Edition or less.
This example statement creates the view with two SQL Azure databases on a local SQL Server:
CREATE VIEW dbo.dpv_Test AS
SELECT * FROM Azure_ODBC0.test0.dbo.fact
SELECT * FROM Azure_ODBC1.test1.dbo.fact;
This is an example of a simple select using a four-part name. More complex statements are allowed, such as adding joins and where clauses. Joins should use four part names in all table references to keep the join work on the same machine. The worst performance occurs when cross server joins are implemented.
The simple example above will return all rows from these tables in all databases. This may be your intention if, for example, you want to bring all the data into PowerPivot for aggregations and pivoting. Be mindful of the cost for bandwidth of returning all the rows. To estimate cost, go to the main Windows Azure pricing site. The appendix provides a query that can be used at any time to get the current billing estimate. Be mindful that the billing values reflect current pricing which you are likely to be modified as Azure matures.
Notes: It is not the view that is partitioned in a distributed partitioned view, it is the data the view points to that is partitioned. The data does not have to be partitioned, this view can point to one database and can contain one select. The tables being referenced in the view can exist in both SQL Azure and local databases as long as the correct permissions are granted for the appropriate users.
There is a simple method for running a query against a single table directly on a SQL Azure server from your application (without using the DPV). As an example, if you open a query window on your local SQL Server Management Studio, while connected to a local instance, you can follow the sample:
SELECT * FROM [YourAzureServer@database.windows.net].[YourDatabase].[YourSchema].[YourTable]
To query all the tables on your Azure server from a local server, simply use the view created above:
SELECT * FROM dpv_Test
Write to a table through the view:
INSERT INTO dpv_Test VALUES (1, ‘Test’, 1)
The first time the view is invoked, either through a select or insert/update/delete command, it will cache the table constraints on the local server. This gives the view some information about where the insert should be directed and it will only touch that database. And if the constraints are used in the WHERE clause of a query, the optimizer will do partition elimination and only touch the databases it needs to touch.
Note: Beware that distributed partitioned views have several restrictions that are well documented in B Online. (This link is one of many to read, use it as a launch point for the other related pages.)
Linked Servers are only available in SQL Server Enterprise Edition or higher.
If your view spans multiple Azure data centers, you need to be aware of one more thing before you get started. On your local server you need to set the Lazy Schema Validation option to TRUE so that the query is not sent to every server for every command. The schema check will defer to run time and may fail if the result set formats are incompatible. :
EXEC sp_serveroption 'LocalServerName', 'lazy schema validation', true
EXEC sp_serveroption ‘RemoteServerName’, ‘lazy schema validation’, true
Another server level setting that can help performance is to ensure that the collation settings are compatible on all servers. This is an optimization option and is not required. What can happen if the collations are not compatible is the filters supplied by the WHERE clauses do not get applied on the remote server. Instead, all data is returned by the query and the WHERE clause gets applied on the local machine.
EXEC sp_serveroption 'LocalServerName', ‘collation compatible’, true
EXEC sp_serveroption ‘RemoteServerName’, ‘collation compatible’, true
This solution is not perfect and at times it can be a struggle. Most of the difficulties have been mentioned in this blog, but the two main ones are summarized here for convenience. The blog contains the solutions/workarounds to these problems so it is worth reading.
1. Optimizer sometimes ignores your WHERE clause and sends the query to every server, every database. The solution is to put constraints on your SQL Azure tables so the optimizer can determine which shard to touch and which to eliminate.
2. Non-remotable query where all data comes back to the local server without being filtered by the remote database.
If the view needs to touch all the databases and one is not available, the query will fail. If the view has determined that it can do partition elimination and does not need the unavailable database, then the query will succeed. The only exception to this is the first time a DPV is executed after a restart of your local SQL Server, it needs to touch every destination database to get the constraints and store them locally (in memory). Remember to turn on lazy schema validation so that every query after the first one does not get sent to each database just to check the schema.
How does SQL Server know which databases to touch? To test this scenario, create the different databases on SQL Azure so that your main table contains data from each month of a certain year. Perhaps create the databases called AppDB2010_01 and AppDB2010_02 to make it easier for you to remember what the database contains. These names are meaningless and have no bearing on what you actually store in the database. Create a table, perhaps call it FACT and make sure it has a DATE or DATETIME (or any other date oriented data type). The key to making this solution work efficiently is to create a constraint on this DATE field so that SQL Server knows it only contains data within a certain range. In this example, it will only contain data for a certain month of a certain year; i.e. January 2010. This constraint is what gets cached on your local server after the first use of the DPV so that SQL Server knows which tables to touch during a query. Now test to confirm using a query such as:
SELECT * FROM dpv_Test WHERE datefield = ‘2010-01-15’.
How do you know a query did not touch a certain database? Since SQL Profiler does not work in SQL Azure, the only way to know which queries ran on a certain server is to use certain DPVs to see what plans have been generated. The following query will return a list of recently generated statistics and plans for a server:
SELECT * FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) ORDER BY last_execution_time DESC;
It is possible to use a local SQL Server to simplify access to a multiple SQL Azure database. The performance may be better than individual links from a front end tool when partition elimination is used.
Thanks to Lubor Kollar for creating these queries.
SELECT * FROM sys.Database_Usage
SELECT * FROM sys.Bandwidth_Usage
SUM ( CASE WHEN USAGE.SKU = 'Web'
THEN (Quantity * 9.99/31)
WHEN USAGE.SKU = 'Business'
THEN (Quantity * 99.99/31)
END ) AS CostInDollars
FROM sys.Database_Usage USAGE
WHERE datepart(yy, TIME) = datepart(yy, GetUTCDate())
AND datepart(mm, TIME) = datepart(mm, GetUTCDate())
GROUP BY SKU
CASE WHEN USAGE.Direction = 'Egress'
THEN 0.15 * USAGE.BandwidthInKB/(1024*1024)
WHEN USAGE.DIRECTION = 'Ingress'
THEN 0.10 * USAGE.BandwidthInKB/(1024*1024)
END AS CostInDollars
SUM(Quantity) AS BandwidthInKB
WHERE datepart(yy, TIME) = datepart(yy, GetUTCDate())
AND datepart(mm, TIME) = datepart(mm, GetUTCDate())
AND class = 'External'
GROUP BY Time_Period, Direction
) AS USAGE
Microsoft Access is a great application to use with SQL Azure and can produce great reports.
Nice post, especially since the documentation says you can't have a linked server to SQL Azure.
A couple of points: You say only the enterprise edition of SQL supports linked servers, they work on our standard editions fine.
Also this method doesn't seem to support money or varchar max data types, as well as tables behind synonyms. Errors produced are:
The OLE DB provider "MSDASQL" for linked server "XXX" supplied inconsistent metadata for a column. The column "XXX" (compile-time ordinal X) of object ""blah"" was reported to have a "DBCOLUMNFLAGS_ISLONG" of 0 at compile time and 128 at run time. for a varchar MAX and:
The OLE DB provider "MSDASQL" for linked server "XXX" supplied inconsistent metadata for a column. The column "XXX" (compile-time ordinal X) of object ""blah"" was reported to have a "DBTYPE" of 131 at compile time and 6 at run time. for the money type.
A workaround is to use openquery (also works for synonyms):
SELECT * FROM OPENQUERY(mylinkedserver,'select mymoneycolumn,cast(myvarcharmaxcolumn as text) from dbname.schemaname.tablename')
I'd obviously be interested in a better way of doing this :)
One tweak - the article says linked servers are only available in Enterprise Edition, but note that they're also available in Developer Edition. Just something to think about when you're playing around with this feature.
Great instructions and sample code. One correction is that you can do it all with a single ODBC entry rather than making one for each connection. You just need to add @provstr to sp_AddLinkedServer:
,@provstr='Server=tcp:YourAzureServer.database.windows.net,1433;Database=YourDatabaseName;User ID=YourLogin;Password=YourPassword;Connection Timeout=60'
That being said, everything in your posting worked for me except for the insert/update against the DP View.
I keep getting an error from Azure indicating my view "...is not updatable because a primary key was not found on table (federated table)...". Updating directly through the linked server connection works, though, so it's definitely connecting to the right place.
It'd be great to know if this still works for you. Then I'd know if it's a problem on my side or a problem with Azure.
Good useful post.
Thanks, a big help.