In this post, the term "distributed database" refers to a set of SQL Server databases, each managed by a SQL server running on a separate computer. All databases have identical schemas, and data that originates in one database is replicated to all other databases, or nodes. A common example would be a system with a central database server and a number of remote machines used by field personnel, each with a local database replica. Data replication in such system can be implemented using a number of techniques, for example, merge replication, peer-to-peer replication, or Sync Services for ADO.NET.
A common problem arising during design of such distributed databases is surrogate key generation. Keys must be unique across the entire distributed database, rather than unique just within a particular database node. A common approach used in practice is to use GUID columns as keys. While straightforward and simple from a developer's perspective, this however has a big disadvantage - the random nature of GUIDs quickly leads to extensive index fragmentation in the database. Additionally, the size of GUID keys is four times larger than the size of integer keys, leading to corresponding increase in index size.
An alternative approach is to use compound two-column keys: one column identifies the database node, while the other column identifies a row within a table on that node. The combination of two columns creates a key that is unique across the distributed database. This works reasonably well, however using compound keys may be somewhat awkward in practice: for example, a table with multiple foreign keys that reference such compound keys will have twice the number of foreign key columns. Storage issues aside, the likelihood of developer confusion and errors would be higher if this approach is used.
In this post, I will present a method of generating single column keys that avoids these problems. The method is based on combining multiple values of smaller numeric data types into a single value of a larger numeric data type.
We have two values to be combined: the database node identifier (DBNodeID), and the row identifier for a particular table on a particular database node (RowID). Let's assume that both are integer values, which would not be unreasonable in practice. Each integer uses four bytes of storage, so to combine the two values without loss of information, we need eight bytes. We will use bigint as the data type for the combined value, which does require eight bytes of storage. To combine two integer values into one bigint value we will use a technique called bit shifting.
Here's an example. Let's say we need to pack DBNodeID 2 and RowID 14 into a single bigint value. In bit representation, these two values appear as follows:
DBNodeID (2): 00000000 00000000 00000000 00000010
RowID (14): 00000000 00000000 00000000 00001110
Using bit shifting, we shift the bits of the first integer to the left, into the two high words of the bigint, and use the bits of the second integer for the two low words of the bigint. Here's the result:
DistributedRowID: 00000000 00000000 00000000 00000010 00000000 00000000 00000000 00001110
In decimal, this corresponds to 8589934606 - a single bigint value that can be used as a key value for a row. This method will generate values that are derived from both DBNodeID and RowID values, and are guaranteed to be unique across the distributed database. In a sense, this method is similar to the two-column compound key approach mentioned earlier, with the advantage that only one key column is needed.
So how can we implement this bit shifting operation in SQL Server? Left-shifting a value by N bits is equivalent to multiplying that value by 2^N. This means that in order to pack the DBNodeID integer into the two high words of a bigint, we need to multiply it by 2^32 (there are 32 bits to represent an integer). Once the DBNodeID integer is packed in the two high words of a bigint, we add the second integer (RowID) to the result to obtain the key value. Here's a T-SQL example (assuming SQL Server 2008 from here on):
DECLARE @DBNodeID int = 2;
DECLARE @RowID int = 14;
SELECT @DBNodeID * POWER(CAST(2 AS bigint), 32) + @RowID AS DistributedRowID
There are multiple ways to implement this approach in a particular database design. One is to have a table, replicated to each database node, to be used as a key generator for all tables that have distributed surrogate keys. Switching to T-SQL again:
CREATE TABLE dbo.KeyGenerator
DBNodeID int NOT NULL,
TableName sysname NOT NULL,
RowID int NOT NULL,
DistributedRowID AS ISNULL((DBNodeID * POWER(CAST(2 AS bigint), 32) + RowID), 0),
CONSTRAINT pkKeyGenerator PRIMARY KEY (DBNodeID, TableName)
We can populate this table with data for a hypothetical three-node two-table distributed database:
INSERT INTO dbo.KeyGenerator (DBNodeID, TableName, RowID)
(1, 'Table1', 1),
(2, 'Table1', 1),
(3, 'Table1', 1),
(1, 'Table2', 1),
(2, 'Table2', 1),
(3, 'Table2', 1);
SELECT * FROM dbo.KeyGenerator produces the following (note the computed key values in the last column):
DBNodeID TableName RowID DistributedRowID
----------- ---------- ----------- --------------------
1 Table1 1 4294967297
1 Table2 1 4294967297
2 Table1 1 8589934593
2 Table2 1 8589934593
3 Table1 1 12884901889
3 Table2 1 12884901889
If an application needs to insert a row into Table2 on database node 1, it can run the following UPDATE query to obtain the key value for the new row, and increment the corresponding RowID value in a single statement:
DECLARE @NewDistributedKey bigint;
UPDATE dbo.KeyGenerator SET
RowID += 1,
@NewDistributedKey = DistributedRowID
WHERE DBNodeID = 1 AND TableName = 'Table2';
The selected value is 4294967297. Executing SELECT * FROM dbo.KeyGenerator one more time produces this result:
DBNodeID TableName RowID DistributedRowID
1 Table2 2 4294967298
Note that the RowID and DistributedRowID values for the second row have been incremented by 1, so the next time the UPDATE query is executed, it will obtain 4294967298 as the next key value for the Table2 table.
In summary, using the approach presented in this post, you can implement a distributed database system that uses single-column numeric surrogate keys, instead of widely used but problematic GUIDs, or more cumbersome compound keys.
Sounds like a reasonable solution, but it seems like a lot of extra work...
If the main problem with GUIDs is just
"extensive index fragmentation", it might be possible to use GUIDs but just run some routine (e.g., nightly) jobs in the database to rebuild indices on your tables, and keep the fragmentation in-check...
Yes, I know this may not be an option for
everyone (in particular it may not work for VLDBs or for 24 x 7 shops), but for others it's certainly feasible.
Have a good day!
Yes, this approach does require some (but hardly a lot of) extra work when designing the database. Also, it may not be optimal for systems requiring very high throughput, because of the key generation overhead compared to keys generated with an IDENTITY column or a NEWID() default. Choosing the best approach will require some thought and thorough knowledge of requirements - the problem is that in many distributed designs today, GUIDs are chosen as the most convenient approach for the developer, without considering their drawbacks.
I think the dbo.KeyGenerator table can be bottle neck.