Creating Primary Keys Across Databases - SQL Azure Team Blog - Site Home - MSDN Blogs

Creating Primary Keys Across Databases

Creating Primary Keys Across Databases

Rate This
  • Comments 11

When you horizontally partitioning data across multiple SQL Azure databases or using Data Sync Server for SQL Azure, there might come a time when you need to write to a member database without causing primary key merge conflicts. In this case you need to be able to generate a primary key that is unique across all databases. In this article we will discuss different techniques to generate primary keys and their advantages and disadvantage.

UniqueIdentifier

One way to generate a unique primary keys is to use the NEWID() function in Transact-SQL, which generates a GUID as a uniqueidentifier data type. The GUID is guaranteed to be unique across all databases.

Advantages:

  • It is a native type to SQL Azure.
  • It is infinitely big and you will never run out of GUIDs
  • Works with both horizontal partitioning and Data Sync Services.

Disadvantage:

  • The disadvantages of using this technique is that based on the GUID; there is no way to identify what database generated it. This can cause extra complications when doing horizontal partitioning.
  • The uniqueidentifier data type is large and will add to the size of your row.

Bigint

Another option is to use a bigint data type in place of an int. In this technique, the primary key is generated from being an identity column; however each identity in each database starts at a different offset. Different offset create the non-conflicting primary keys.

The first question most people ask, is bigint data type big enough to represent all the primary keys need. The bigInt data type can be as large as 9,223,372,036,854,775,807 because it is stored in 8 bytes. This is 4,294,967,298 times bigger than the maximum size of an int data type: 2,147,483,647. This means that you could potentially have 4 billion SQL Azure databases horizontally partitioned with tables of around 2 billion rows. More information about data types and sizes can be found here.

On the first SQL Azure database you would create the table like this:

CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (1,1))

On the second SQL Azure database you would create the table like this:

CREATE TABLE TEST(x bigint PRIMARY KEY IDENTITY (2147483648,1))

And continue incrementing the seed value for each database in the horizontal partitioning.

Advantages:

  • It is easier to upgrade from a legacy tables that used an int data type as the primary key to a bigint data type (the legacy table would be the first partition).
  • You can reparation easier than some of the other techniques, since moving rows involve a straight forward case statement (not a recalculated hash).
  • The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.
  • The bigint data type consumes 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.

Disadvantages:

  • The database schema for each partition is different.
  • This technique works well for horizontal partitioning, but not for Data Sync Service.

Primary Key Pool

In this technique a single identity database is built where all the primary keys are stored, however none of the data. This identity database just has a set of matching tables that contain a single column of integers (int data type) as an auto incrementing identity. When an insert is needed on any of the tables across the whole partition, the data tier code inserts into the identity database and fetches the @@IDENTITY. This primary key from the identity database is used as the primary key to insert into the member database or the partition. Because the identity database is generating the keys there is never a conflict.

So how many integers can a 50 Gigabyte SQL Azure database hold? This is a good question, because if you run out of space on your database acting as the primary key pool, then you can’t insert anymore rows. If all your tables where single column integers in the primary key database you could have 25,000 tables with two million rows (table size of 2 Megabytes) in a 50 Gigabyte SQL Azure database. 50 Gigabytes is currently the largest SQL Azure database you could use for your primary key database. Or some combination of that, like 12,000 tables of 4 million rows, or 6,000 tables of 8 million rows.

Advantages:

  • This is the easiest technique to implement with legacy tables; there are no data type changes. However, the IDENTITY attribute needs to be removed from the data tables.
  • Works with both horizontal partitioning and Data Sync Services.

Disadvantages:

  • This technique works best in low write high read scenarios where contention for the primary key database isn’t an issue.
  • Every INSERT requires an extra query to the primary database, which can causes performance issues, especially if the data database and the primary database are not in the same data center.
  • There is no way for the data tier layer to identify in what partition that the data is located in from the primary key. For this reason it works best with Data Sync Services, where you have a known member database you are reading and write too.
  • There is a constraint on the number of primary keys (int data type) you can hold in the largest 50 Giga byte SQL Azure database, which is a limitation on the number rows in your overall partition.

Double Column Primary Key

Another technique is to use two columns to represent the primary key. The first column is an integer that specifies the partition or the member database. The second column is an int IDENTITY, that auto increments. With multiple member or partition databases the second column would have conflicts, however together the two columns would create a unique primary key.

Here is an example of a CREATE TABLE statement with a double column primary key:

CREATE TABLE TEST ([partition] int, 
    [id] int IDENTITY,
     CONSTRAINT PK_TEST 
        PRIMARY KEY([partition], [id]));

Remember you need to add a second column for all the primary keys, and a second column to all foreign key references.

Advantages:

  • It is easier to upgrade from a legacy tables by adding an additional column then it is to convert those tables to uniqueidentifier primary keys.
  • Two integers consume 8 bytes of space, which is smaller than the uniqueidentifier data type that take up 16 bytes.
  • Works with both horizontal partitioning and Data Sync Services.
  • The data tier code implementing the partitioning can figure out which partition that the primary key is in, unlike a using a uniqueidentifier for a primary key.

Disadvantages:

  • It feels unnatural and cumbersome to have two columns be your primary key.
  • The data tier code needs to keep track of two integers to access the primary key. This can be obfuscated by using language elements to create a new data type, like a custom struct in C#.

Summary

Do you have a favorite technique for solving the same problem; share it with us in the comments below. Or give us your opinion about what technique is the best. Have questions, concerns, comments? Post them below and we will try to address them.

  • My understanding is that newid is non sequential and so a major problem with the first approach, particularly on SQL Azure, is that things can get very slow. Unless you adopt a second column in combination with this that you cluster on. Is this right?

  • I'd also like to understand the issues with BigInt and Data Sync Services. I don't know much about the latter yet but i am looking at exactly this issue just now :-)

  • Steven:  That is correct, I covered that in this blog post: "Uniqueidentifier and Clustered Indexes" blogs.msdn.com/.../10007304.aspx

  • Steven:  There are several blog posts about Data Sync Service for SQL Azure, starting with this one: "Introduction to Data Sync Service for SQL Azure" blogs.msdn.com/.../10035099.aspx

  • I'm also interested in why bigint "works well for horizontal partitioning, but not for Data Sync Service."

    According to the approach outlined here, you would not have conflicts due to the offsets (unless I am missing something.)

  • Andy & Steven:  Good question. Data Sync Service creates an exact copy of your hub database to all your member databases.  This means that the identity offset will be the same on all the databases.  This means that inserts to member databases might have conflicting primary keys.  So just to be clear, bigint will work as a primary key with Data Sync Service, however the technique of using a bigint with an identify offset will not.  Secondly, conflicts are not bad in certain scenarios, however the article is about creating non-conflicting primary keys.

  • Ahh - that makes sense. Would be neat if Data Sync Service could allow a config to store this info so we could seed it declaratively - however i get the situation as it stands.

    Data Sync Services I suspect will become a critical part of any architecture that needs to automatically scale so BigInt alone may not be the best.

    However, if, you used DSS and each database with a different BigInt starting seed as discussed above was defined as a hub and 2 separate member servers assigned to each hub could that not work? Would effectively be completely independent copies, no?

  • Steven: As long as you didn't write to the member databases, otherwise on the bi-directional sync you would have a conflict.

  • Hi Steven,

    I am definitely in agreement with you on this.  I think most people start building their database not taking into account the need to synchronize it and the things you would need to take account from a db design perspective to support sync.  It is for that reason that I also think we need to account for this in the Data Sync Service.  Seeding the BigInt's are a good technique to accomodate this and is something that we have been researching.  Of course that is under the assumption that the seed is sufficiently big enough and one database does not run over the other databases seed value which is something we will have to monitor in the data sync service and adjust the seeds as needed.  

    Once again thanks for the feedback.

    Liam Cavanagh

    Sr. Program Manager - SQL Azure Data Sync

  • "Data Sync Service creates an exact copy of your hub database to all your member databases" - this is true if the member databases don't have a copy of the table already. If they did (i.e., we created it before performing the sync) and seeded it appropriately, then the bigint scheme could work with Data Sync - is this right?

  • Another scheme that may work, assuming you only need to have bi-directional sync between 2 servers (i.e., SQL Server and SQL Azure) is to use a IDENTITY(1,1) for the SQL Server and IDENTITY(-1,-1) for SQL Azure. Therefore keys for SQL Azure become bigger negatively and keys for SQL Server grow positively as usual.

Page 1 of 1 (11 items)
Leave a Comment
  • Please add 8 and 2 and type the answer here:
  • Post