GUID guide, part two

GUID guide, part two

Rate This
  • Comments 16

So how is it that a GUID can be guaranteed to be unique without some sort of central authority that ensures uniqueness, a la the ISBN system?

Well, first off, notice that the number of possible GUIDs is vastly larger than the number of possible ISBNs. Because the last of the thirteen digits is a checksum, there are only 1012 possible ISBNs. That is about a hundred unique ISBNs for every person on earth. That's almost exactly 240, so you could represent an ISBN by a 40 bit number (again, ignoring the checksum). There are 2128 possible GUIDs; that's about 40 billion billion billion unique GUIDs for every person on earth. This alone gives us the intuition that it ought to be pretty easy to ensure that two of them never collide; there are a lot of GUIDs to choose from!

There are a number of possible strategies for making a unique GUID, and in fact information about the strategy used is encoded in the first four bits of the third "group"; almost every GUID you see will be of the form {xxxxxxxx-xxxx-1xxx-xxxx-xxxxxxxxxxxx} or {xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx}.

If there is a one in that place then the algorithm used to guarantee uniqueness is essentially a variation on the ISBN strategy. The GUID is guaranteed to be unique "in space" by choosing some of the bits as the MAC address of the network card in the machine. (The tricky problem of ensuring that no two network cards in the world have the same MAC address is solved somehow by someone else; how that problem is solved, we don't particularly care. The cost of solving that problem is passed on to you, the consumer, in the purchase cost of the network card.)

(UPDATE: Larry Osterman pointed out to me that of course, this MAC address solution is not foolproof. First, you could deliberately or accidentally set your MAC address to an already-used value. Second, a hardware manufacturer might forget to set the address at all and have it be all zeros. Third, two virtual machines might be using the same physical network card and those virtual machines could be generating GUIDs at exactly the same time fast enough to cause collisions.)

We know that we can rely on this as a source of uniqueness in space. Most of the remaining bits are a high-resolution timestamp. Therefore every generated GUID is unique in both space and time, and is therefore globally unique.

This system does in practice have a few weak spots. The most obvious one is that it does not work well if the machine does not have a network card! Version-one GUIDs generated on machines without network cards are not guaranteed to be unique. The less obvious one is that there is a slim chance that two GUIDs could be generated "at the same time". Perhaps two GUID generators were running on two different processors in the same machine at the same time. Or perhaps a GUID was generated, the clock in the machine was deliberately "turned back", and the same GUID was then generated again, just by bad luck. There are a few "extra" bits in a GUID that are used to mitigate these timing problems, so in practice they don't arise.

There are a number of interesting consequences of this algorithm. The first is that such GUIDs are almost completely non-random. Many people seem to be of the mistaken belief that GUIDs are a source of randomness, when in fact they are only guaranteed to be a source of uniqueness.

The second is that it is possible for GUIDs generated on a particular machine with this algorithm to be monotone increasing. This is actually a very nice property for GUIDs to have; GUIDs are often used as primary keys in databases, and it can be much cheaper to insert a large number of rows into an indexed-by-primary-key database table if the rows are already sorted and you're always inserting after every previous entry. This again demonstrates that using a sequence of GUIDs as a sequence of random 128 bit numbers is a terrible idea; random numbers are typically not monotone increasing!

The third interesting consequence is that code or documents that contains a GUID generated with this first algorithm contain information that uniquely identifies the machine that was used to create the GUID. Just as a skilled reader can determine interesting facts about a book from its ISBN, so too can a skilled reader determine when and where a GUID was generated if it has a one as its thirteenth hex digit. This fact was used when tracking down and prosecuting the author of the famous Melissa virus. (We'll discuss the implications of this in more detail in an upcoming episode.)

The fourth interesting consequence is that no proper subsequence of the bits of a GUID have the global uniqueness property, as Raymond pointed out back in 2008. And indeed, we have no reason to expect that a smaller set of bits would have the same properties as a larger set of bits! You don't expect to be able to saw one aircraft in half and end up with two things that both fly.

Next time we'll talk about GUIDs that have a four in their thirteenth hex digit; they use a completely different technique for ensuring uniqueness.

  • MAC addresses are not very unique, some reasons:

    - 00:00:00:00:00:00 is rather common, when the vendor forgot to flash the MAC address

    - Some batches of MAC addresses have been used multiple times due to administrative errors

    - ISP's tend to allow only one DHCP lease per end user, and thus, if you went from a bridged connection to a routed one, you'd loose your DHCP lease. So the modem has a feature to "clone" the PC's MAC address.

    - If you copy a virtual machine, you can end up with two virtual machines with the same MAC running at different locations

    - Network admins can assign a MAC address as "known". In a variation on the "clone" theme, this causes users to copy their MAC address around just so they can use their new PC.

    Still, a GUID seems to be pretty unique in practice.

  • This is a great article on an interesting subject, thanks.

  • I have offtopic question! About immutability.

    So Roslyn is deeply immutable library. Is it clear win? How things are going? I mean feedback etc. Is it worth the efforts comparing to mutable approach? If everything fine then could immutability be apllied say to c# xml library? What do think? Is it time for imperative developers to switch to immutable stuff? Tnx

  • Good article, but I feel like mentioning -- and I'm sure you know -- that in a properly constructed GUID, between five and seven (usually six) bits are reserved to provide information about the type and version of the GUID. In practice, there are only two values of those 5-7 bits in common use, effectively reducing them to one bit, so in practice there are much only about 2^123 GUIDs rather than 2^128. Still a lot, but it decreases it from 40 billion billion billion per person to one billion billion billion. :-)

  • The name-based GUIDs described by RFC4122 have a "3" or a "5" as the thirteenth hex digits. There's no built-in support for creating these GUIDs in .NET, so I implemented my own: code.logos.com/.../generating_a_deterministic_guid.html

  • This is a great series of posts - fascinating and useful. Thanks!

  • Why is it that the 13th digit is the one that's set?  Why not the first bit (as opposed to digit).  Assuming values other than 1 and 4 aren't valid you end up rendering a large number of possible GUIDs "invalid", unless they are reserved for currently unknown mechanisms for generating a GUID.  It still leaves the question of why not have it at the start though...

  • Perhaps it is not at the start for sorting reasons?

  • MAC addresses are 48 bits = 12 hex digits. I would guess that this wasn't planned to be an algorithm flag, but that the times just happen to always start with a 1 based on whatever epoch they use.

  • Do you still need to smash a network card with a hammer (blogs.msdn.com/.../71307.aspx) for an app to generate hardware-independent GUIDs?

  • @servy42

    I think that originally there was only one type of GUID, and they didn't think of a type designator. So when they introduced this designator, they used a digit that was the same for all GUIDs created with the MAC method.

  • I am curious if adding randomization into GUID would not make it more unique?

  • Well, if you use a method giving guaranteed unique values, there's no way to make it more unique.

    Now, all the guids using hashing / random guids only are pretty likely unique in a reasonably sized data-set.

    There you could improve things by adding more uniquifiers. But that's not really relevant in realistic use-cases not dealing with more than 10^20 or so items or having to fend of attackers.

  • @Servy42:  The strategy identifier is a digit, not a bit, in order to have more than two of them in the future (even if only two methods are currently defined, and I'm not completely sure abou that).

    Second, it doesn't matter WHERE the strategy number appears in the GUID.  The bits that make up the strategy number could be scattered throughout the 128-bit GUID (although it's a little easier to spot the way it's done).  Why do you care whether the identifier is at the front or in the middle?

  • @Paulius:  The phrase "more unique" doesn't make sense.  Unique means "one of a kind".  Something can't be "more unique" than something else.

    GUIDs are designed to be unique, period.  Also, they are not designed to be random.  But the fact that they are guaranteed to be unique when generated by cooperating entities is all that you need to know.  The algorithm is designed to GUARANTEE uniqueness, not just "rarity".  Adding randomness to that won't help the situation, since the situation is already perfect for its intended use.  (Note that some strategies use randomness as one component, along with other components, of the GUID being generated.)

Page 1 of 2 (16 items) 12