Uniqueidentifier and Clustered Indexes - SQL Azure Team Blog - Site Home - MSDN Blogs

Uniqueidentifier and Clustered Indexes

Uniqueidentifier and Clustered Indexes

Rate This
  • Comments 7

I love GUIDs -- the uniqueidentifier data type in SQL Server. I use them for everything, domain names, unique error message, and for primary keys in my database tables, etc… -- don’t get me started. About two years ago, I started using uniqueidentifier for primary keys because of their unique ability to merge.

Merging

When you create a GUID in SQL Server using the NewId() command you are guaranteed that it will be unique across the whole universe. Which means if you have two databases (with the same schema) completely disconnected adding rows to the same table, using a primary key of uniqueidentifier will ensure that they primary keys don’t conflict. In comparison, if the two databases had an identity integer column as the primary key for their tables, they would be very likely to insert the same primary key in both tables.

What does this have to do with merging? It is ridiculously easy to merge the tables of our disconnected databases with each other if they are using uniqueidentifier as their primary keys. On the other hand, it is much harder to resolve conflicts when merging between the identity integer primary keys and updating the foreign key references to those primary keys.

Clustering and uniqueidentifier

It isn’t a good idea to create a clustered index on a uniqueidentifier column and generate your GUIDs with NEWID(). The reason for this is that NEWID() generates GUIDs in non-sequential order and SQL Server orders a clustered index sequentially. It will work – SQL Server will let you build a clustered index around a uniqueidentifier column, however it will cause the SQL Server to do unnecessary work and cause performance slowdowns. The reason for this is that to insert data into the middle of a clustered index (out of sequential order) causes SQL Server to make room for the data by rearranging the cluster.

So if it isn’t a good idea then why do people do it? Well, in SQL Server, if I assign a column as the primary key in SQL Server Management Studio it automatically generates a clustered index for you regardless of the data type of that column. Therefore, if you want a table with a uniqueidentifier data type as a primary key you need to change that index to a non-clustered index.

Non-clustered indexes don’t reorder the data as rows are inserted to the table, so they don’t have the performance impact of a clustered index on inserts of non-sequential data.

Along comes SQL Azure

If you love uniqueidentifier data types for primary keys like I love them and you are creating non-clustered indexes on your primary keys then you need to pick a clustered index for the table also. The reason is that SQL Azure requires one (and only one) clustered index on all tables. However, just because SQL Azure requires a clustered index doesn’t mean it should be the primary key column.

Picking a Clustered Index

There are a couple of strategies for picking your clustered indexes; one of the easiest and best is to add another column of data type datetime and use that column for your clustered index. Here is what you need to do:

1. Add the column as data type datetime

2. I usually call it Date

3. Set the Default Value to GetDate().

4. Make it non-null.

5. Create your clustered index on it before you insert data into you.

Adding a default value of GetDate() to the column writes the date and time that the row was inserted into the column automatically. This insures that the data for the row is inserted at the end of the table data – there is no rearranging of the cluster. Adding the data to the end ensures the best performance for inserts.

Another good choice for a clustered index is a column that reflects the ordering of the table in the majority of the select statements. For example, if you have a table called categories and you have an integer column called ordervalue and you always call the table with an SELECT … ORDER BY [ordervalue] then making ordervalue the clustered index makes sense. Here is why: even though it hinders performance to insert a non-sequential ordervalue into the cluster you will get a performance benefit when you call your data, since the rows will be read sequential from the cluster, that it depending on your workload characteristics.

NEWSEQUENTIALID()

I would be remise if I didn’t mention the NEWSEQUENTIALID() function which not supported in SQL Azure. If the NEWID() function generates unique non-sequential uniqueidentifier than NEWSEQUENTIALID() function generates unique sequential uniqueidentifier. The only trick to NEWSEQUENTIALID() function is that the GUIDs are generated partial based on the network card of the computer.

This means that you can successfully have a uniqueidentifier as a primary key column and use that primary key column the required clustered index for SQL Server. As long as you use the NEWSEQUENTIALID() function column to fill that column.

If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID. For example if you are passing the primary key to the table in a query string on a web browser. See MSDN for more information.

However, NEWSEQUENTIALID() function isn’t support for SQL Azure. If you try to use it you will get this error:

Msg 40511, Level 15, State 1, Line 1

Built-in function 'NEWSEQUENTIALID' is not supported in this version of SQL Server.

Migration

If you are migrating an existing SQL Server database to SQL Azure you need to do these things to work successfully with uniqueidentifier data type as your primary key.

  • Convert the generation of GUIDs from NEWSEQUENTIALID() to NEWID()
  • Remove all clustered indexes from uniqueidentifier data type columns – this will not prevent you from migrating, however it will give you better performance.
  • Pick a column (or add a column) to build your clustered index around (see the recommendations above).

Do you have questions, concerns, comments? Post them below and we will try to address them.

  • Better solution: Write your own function on datetime + guid to generate your new ids that are sortable and incremental. There are many approaches out there. And the best part is that you can write the same thing in .net so you dont' have to round trip for the value created like you do with newsequentialid.

    End result is similar speed to ints, clustering on your primary key (where it should be because you're doing most of your searching on it in joins!) and the uniqueness of guids.

  • I believe that what Geminiman is talking about is the guid.comb guid.

    http://jeffreypalermo.com/blog/use-guid-comb-in-your-database-if-you-need-guid-keys-but-don-t-want-to-take-a-big-performance-hit/

  • Picking a clustered index is both and art and a science.  We only touched on the possibilities for making a good choice.  I would hesitate to implement any solution that could guarantee uniqueness with the primary key when generating a GUID.  Best to handle the performance of the join on the primary key with a non-clustered index and pick your clustered index based on the table usage.

  • Good discussion that for the most part applies equally to non-SQL Azure instances of SQL Server.  I'd point out that clustered indexes on GUID tend to get an unfairly bad rap in the community.  One reason is that (with a non-SQL Azure SQL instance, at least) you can eliminate the frequent page split issue simply by using NEWSEQUENTIALID.  Even if you don't do that, the insert overhead of the page splits would be a negligible contributor to the overall cost of most query workloads.  

    One comment about:

    > Non-clustered indexes don’t reorder the data as rows are inserted to the table

    To clarify, a nonclustered index is a B-tree just like a clustered index, and just like a clustered index it must remain ordered according to the index key(s) as new data is inserted.  If the data arrives out of order according to the NC index key(s), you're going to get regular page splits within the nonclustered index on inserts.  This will be the case for every nonclustered index you build on *any* column, unless that index's leading column just happens to be correlated with your clustering key in a particular way (constantly increases or decreasing as the clustering key increases).  

  • Good Post. Wanted to create a table in my Azure SQL Database, installed the sql management tool and clicked create table. DOH - no designer? Realizing how long time ago i wrote SQL. Could you give a little sample for creating a table based on the idea above with a column with uniqueidentifier and default to newid(). and a clustered datetime column with default value also.

  • Personally, I would not recommend using a datetime column for the clustered key because it does not guarantee uniqueness of the values. A better choice would be to use an int/bigint identity column.

  • What is the benefit of using a GUID as primary key over an integer? I'm building a Windows Azure application that I want to be scalable so in that context, wht do I get from a GUID that justifies the datasize and indexing complications of a GUID?

Page 1 of 1 (7 items)
Leave a Comment
  • Please add 6 and 7 and type the answer here:
  • Post