I recently read a blog post on what was better using GUIDs or Integer values. This is been an age long debate and there are advocates in both camps stressing on the disadvantages of the other. Well both implementations have their advantages and disadvantages. At the outset, I shall mention that the answer to this debate is: IT DEPENDS! J
It is highly dependent on your database design, migration needs and overall architecture. There is a good reason why SQL Server replication uses GUIDs to track the changes to the replicated articles. So, it is not that the usage to GUIDs is necessarily a bad practice. SQL Server Books Online lists the following disadvantages for uniqueidentifier data type:
· The values are long and obscure. This makes them difficult for users to type correctly, and more difficult for users to remember.
· The values are random and cannot accept any patterns that may make them more meaningful to users.
· There is no way to determine the sequence in which uniqueidentifier values were generated. They are not suited for existing applications that depend on incrementing key values serially.
· At 16 bytes, the uniqueidentifier data type is relatively larger than other data types, such as 4-byte integers. This means indexes that are built using uniqueidentifier keys might be relatively slower than indexes using an int key.
If you are using NEWID function in SQL Server, then this generates random UUIDs which have a huge domain but the chances of GUID collisions are always there though the probability is very slim in nature. If you are using NEWID function to generate uniqueidentifiers as row identifiers in your table, then you need to think again! Uniqueness of the row should be enforced using a Unique or Primary Key constraint on the table. NewSequentialID function uses identification number of the computer network card plus a unique number from the CPU clock to generate the uniqueidentifier (Reference article). So the chance of getting a globally unique value is practically guaranteed as long as the machine has a network card. Moreover, possibility of a GUID collision while using NewSequentialID is virtually impossible.
I shall use a simple example of how database operations can be affected when using the following as clustered indexes:
1. Random GUIDs (using NEWID function)
2. Sequential GUIDs (using NewSequentialGUID function)
3. BIGINT identity value
4. INT identity value
The last three would always introduce a hot-spot in your database table when concurrent inserts are occurring on the table as the INSERT operations would always insert into the last page of the table as the data for a clustered index is ordered in ascending order by default. This however is a discussion for a different post altogether. The tables that I have created are quite narrow and typically the average record size would be small:
create table tblGUID (RecID uniqueidentifier default newid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblSeqGUID (RecID uniqueidentifier default newsequentialid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblINT (RecID bigint identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
create table tblBigINT (RecID int identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
The clustered index for all the above tables are defined on the RecID column. The tables have no other indexes defined on them. I created a WHILE loop to insert 1 million records in each table and here are the times taken:
tblGUID: 22 seconds
tblSequential GUID: 16 seconds
tblBIGINT: 23 seconds
tblINT: 23 seconds
Given that you have a beefy server, the above time difference would not make much of a difference unless and until you only have a high number of concurrent INSERT workload on the server or during a Data Load operation which would cause a significant impact. What is interesting to note is that the fragmentation on the tables after the first batch of 1 million inserts.
Average Record Size
Average Page Density
If you look at the above data, you will see that the random GUIDs have 99% logical fragmentation in the tables. This is due to the random nature of the GUIDs generated which end up causing high number of page splits in the database.
Since GUIDs require more storage than integers, you will notice the number of pages for the tables using GUIDs have higher page counts. After inserting another million records in the above tables, I inserted 10,000 records (in batches of 100) in the above tables and tracked the times for the inserts at a more granular level:
tblGuid : 23 seconds (10 ms to 1120 ms)
tblSeqGuid : 0 seconds (3 ms to 10 ms)
tblBigINT : 0 seconds (3 ms to 13 ms)
tblINT : 0 seconds (3 ms to 20 ms)
You will find that the minimum and maximum times mentioned above show a high degree of variation for the random GUID table. The rest of the times are more or less comparable. The time taken for an insert is dependent on the range that the newly generated GUID falls into. The above data clearly shows that using random GUIDs can not only create fragmentation and page splits during Insert/Update operations.
Since NewSequentialID function uses the network card identifier to generate a new value, it might not be preferred where privacy is a concern because it is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
Now when I run a simple SELECT query against the above tables (using a cold buffer cache), I find the a COUNT(*) for the entire table shows the following IO statistics:
Table 'tblGUID'. Scan count 9, logical reads 20942, physical reads 236, read-ahead reads 20564, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 375 ms, elapsed time = 6212 ms.
Table 'tblSeqGUID'. Scan count 9, logical reads 14519, physical reads 383, read-ahead reads 14216, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 466 ms, elapsed time = 1442 ms.
Table 'tblBigINT'. Scan count 9, logical reads 12229, physical reads 287, read-ahead reads 12007, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 374 ms, elapsed time = 1039 ms.
Table 'tblINT'. Scan count 9, logical reads 11144, physical reads 252, read-ahead reads 10983, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 392 ms, elapsed time = 908 ms.
You will see that the number of pages read for the table using random GUIDs is the highest due to the high fragmentation on the tables. After I rebuilt the clustered index of the tblGUID table, the same query with a cold buffer cache only performed 14K+ reads which is comparable to the tblSeqGUID table.
Table 'tblGUID'. Scan count 9, logical reads 14376, physical reads 143, read-ahead reads 10642, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
However if you have frequent inserts in the table, you cannot keep the fragmentation down to a negligible value.
I have summarized what we discussed above in the table below:
Comparable but the index keys are larger in size. For wider tables, this would be slower than Integer values.
Faster than sequential GUIDs
Highest, due to smaller size of RIDs
JOIN Performance/SORT operations
(Rank 4 = Least performance)
Better than random GUIDs due lesser fragmentation (Rank: 3)
High performance (Rank: 2)
Merging data across servers
Rare chance of duplicates
Globally unique. Virtually no chance of collisions
Limited by range of BIGINT
Limited by range of INT
The above considerations make the use of GUIDs unfavorable for a clustered index in environments which have large number of queries performing JOIN operations in OLTP and when referential integrity is enforced on the database among multiple tables. Throw non-clustered indexes that you created on the table as covering indexes for the frequently run queries against the database, you can have a significant performance bottleneck.
So hopefully the above points will help you choose an appropriate column for the clustered index on your table. The other strategy is to use natural keys. With this strategy, you do not use any kind of manufactured key, but instead use a business key to uniquely identify records. For example, a table that is used to store customer information might use the social security number column as the primary key instead of an identity column. The drawback to this approach is that the primary key might become large if more than one column is required to uniquely identify a record. Furthermore, this compound key must be propagated to other tables to support one or more foreign key relationships. These relationships, in turn, adversely affect join performance.
Based on the above considerations, you should be able to take an educated decision whether to use an integer, business key or GUIDs for your primary key of your table.
As a parting note, I would like to say that there is no short cut to testing. You would need to test the design that you decide to implement with a workload which is as close to your production workload to make sure that your design is performing as expected.
System Specifications used for the test:
Operating System: Microsoft Windows Server 2008 R2 Enterprise (x64)
Machine: Hewlett-Packard HP Z800 Workstation
Processor: 2 Quad Core [Intel(R) Xeon(R) CPU E5506 @ 2.13GHz, 2128 Mhz, 4 Core(s), 4 Logical Processor(s)]
RAM: 16.0 GB
Hard Disk: Barracuda 7200.12 SATA 3Gb/s 500GB Hard Drive
SQL Server: SQL Server 2008 R2
Regards,Amit BanerjeeSEE, SQL Support
Amit BanerjeeSEE, SQL Support
Congratulations, very nice and interesting article.
Very Helpful. Thank you!
Another very good, scientific, facts-based article - Simply, but very worthy
Very interesting insights. It would be even more helpful if you followed up with specific statistics regarding the impact of GUIDs on joins, as opposed to simply stating the rank. But of course, you emphasized testing.
Excellent introduction to a design decision that needs to be made.