When you horizontally partitioning data across multiple SQL Azure databases or using Data Sync Server for SQL Azure, there might come a time when you need to write to a member database without causing primary key merge conflicts. In this case you need to be able to generate a primary key that is unique across all databases. In this article we will discuss different techniques to generate primary keys and their advantages and disadvantage.
One way to generate a unique primary keys is to use the NEWID() function in Transact-SQL, which generates a GUID as a uniqueidentifier data type. The GUID is guaranteed to be unique across all databases.
Advantages:
Disadvantage:
Another option is to use a bigint data type in place of an int. In this technique, the primary key is generated from being an identity column; however each identity in each database starts at a different offset. Different offset create the non-conflicting primary keys.
The first question most people ask, is bigint data type big enough to represent all the primary keys need. The bigInt data type can be as large as 9,223,372,036,854,775,807 because it is stored in 8 bytes. This is 4,294,967,298 times bigger than the maximum size of an int data type: 2,147,483,647. This means that you could potentially have 4 billion SQL Azure databases horizontally partitioned with tables of around 2 billion rows. More information about data types and sizes can be found here.
On the first SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (1,1))
On the second SQL Azure database you would create the table like this:
CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (2147483648,1))
And continue incrementing the seed value for each database in the horizontal partitioning.
Disadvantages:
In this technique a single identity database is built where all the primary keys are stored, however none of the data. This identity database just has a set of matching tables that contain a single column of integers (int data type) as an auto incrementing identity. When an insert is needed on any of the tables across the whole partition, the data tier code inserts into the identity database and fetches the @@IDENTITY. This primary key from the identity database is used as the primary key to insert into the member database or the partition. Because the identity database is generating the keys there is never a conflict.
So how many integers can a 50 Gigabyte SQL Azure database hold? This is a good question, because if you run out of space on your database acting as the primary key pool, then you can’t insert anymore rows. If all your tables where single column integers in the primary key database you could have 25,000 tables with two million rows (table size of 2 Megabytes) in a 50 Gigabyte SQL Azure database. 50 Gigabytes is currently the largest SQL Azure database you could use for your primary key database. Or some combination of that, like 12,000 tables of 4 million rows, or 6,000 tables of 8 million rows.
Another technique is to use two columns to represent the primary key. The first column is an integer that specifies the partition or the member database. The second column is an int IDENTITY, that auto increments. With multiple member or partition databases the second column would have conflicts, however together the two columns would create a unique primary key.
Here is an example of a CREATE TABLE statement with a double column primary key:
CREATE TABLE TEST ([partition] int, [id] int IDENTITY, CONSTRAINT PK_TEST PRIMARY KEY([partition], [id]));
Remember you need to add a second column for all the primary keys, and a second column to all foreign key references.
Do you have a favorite technique for solving the same problem; share it with us in the comments below. Or give us your opinion about what technique is the best. Have questions, concerns, comments? Post them below and we will try to address them.
My understanding is that newid is non sequential and so a major problem with the first approach, particularly on SQL Azure, is that things can get very slow. Unless you adopt a second column in combination with this that you cluster on. Is this right?
I'd also like to understand the issues with BigInt and Data Sync Services. I don't know much about the latter yet but i am looking at exactly this issue just now :-)
Steven: That is correct, I covered that in this blog post: "Uniqueidentifier and Clustered Indexes" blogs.msdn.com/.../10007304.aspx
Steven: There are several blog posts about Data Sync Service for SQL Azure, starting with this one: "Introduction to Data Sync Service for SQL Azure" blogs.msdn.com/.../10035099.aspx
I'm also interested in why bigint "works well for horizontal partitioning, but not for Data Sync Service."
According to the approach outlined here, you would not have conflicts due to the offsets (unless I am missing something.)
Andy & Steven: Good question. Data Sync Service creates an exact copy of your hub database to all your member databases. This means that the identity offset will be the same on all the databases. This means that inserts to member databases might have conflicting primary keys. So just to be clear, bigint will work as a primary key with Data Sync Service, however the technique of using a bigint with an identify offset will not. Secondly, conflicts are not bad in certain scenarios, however the article is about creating non-conflicting primary keys.
Ahh - that makes sense. Would be neat if Data Sync Service could allow a config to store this info so we could seed it declaratively - however i get the situation as it stands.
Data Sync Services I suspect will become a critical part of any architecture that needs to automatically scale so BigInt alone may not be the best.
However, if, you used DSS and each database with a different BigInt starting seed as discussed above was defined as a hub and 2 separate member servers assigned to each hub could that not work? Would effectively be completely independent copies, no?
Steven: As long as you didn't write to the member databases, otherwise on the bi-directional sync you would have a conflict.
Hi Steven,
I am definitely in agreement with you on this. I think most people start building their database not taking into account the need to synchronize it and the things you would need to take account from a db design perspective to support sync. It is for that reason that I also think we need to account for this in the Data Sync Service. Seeding the BigInt's are a good technique to accomodate this and is something that we have been researching. Of course that is under the assumption that the seed is sufficiently big enough and one database does not run over the other databases seed value which is something we will have to monitor in the data sync service and adjust the seeds as needed.
Once again thanks for the feedback.
Liam Cavanagh
Sr. Program Manager - SQL Azure Data Sync
"Data Sync Service creates an exact copy of your hub database to all your member databases" - this is true if the member databases don't have a copy of the table already. If they did (i.e., we created it before performing the sync) and seeded it appropriately, then the bigint scheme could work with Data Sync - is this right?
Another scheme that may work, assuming you only need to have bi-directional sync between 2 servers (i.e., SQL Server and SQL Azure) is to use a IDENTITY(1,1) for the SQL Server and IDENTITY(-1,-1) for SQL Azure. Therefore keys for SQL Azure become bigger negatively and keys for SQL Server grow positively as usual.