<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx</link><description>Suppose you’ve got a huge mixed-up pile of white, black, green and red socks, with roughly equal numbers of each. You randomly choose two of them. What is the probability that they are a matched pair? 
 There are sixteen ways of choosing a pair of socks</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9997866</link><pubDate>Sat, 17 Apr 2010 22:14:47 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9997866</guid><dc:creator>Harold</dc:creator><description>&lt;p&gt;Hey guys, you are aware that .NET generates &amp;quot;V4&amp;quot; GUIDs by default, right? So it'll be a pseudorandom number, not your MAC address.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9997866" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9997275</link><pubDate>Fri, 16 Apr 2010 15:04:38 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9997275</guid><dc:creator>MikeH</dc:creator><description>&lt;p&gt;&amp;quot;... straight line on a log-log&amp;quot; of course that only goes so far (since of course it's never going to go above 100%) but is there a graph that doesn't tail-off? &amp;nbsp;&lt;/p&gt;
&lt;p&gt;Could it be that if we compare socks pulled out with expected number of pairs that the straight line will continue. It seems reasonable that expected number approximates to percentage with small percentages but (switching examples) if you ask 364 people for their birthday you will still not get 100% but you would expect to get dozens(?) of collisions. (sadly I'm not sure how to calculate this and how many pairs of socks do you have when you have 3 matching socks anyhow?)&lt;/p&gt;
&lt;p&gt;...or how would we draw the chances of NOT getting a pair as you pull out large number of socks (there's a factorial in there somewhere!)?&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9997275" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9987239</link><pubDate>Tue, 30 Mar 2010 07:40:25 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9987239</guid><dc:creator>stan</dc:creator><description>&lt;p&gt;Say you want to compile a list of unique names from a phone book. Obviously some type of hash table would be the way to do this, but given the comments above, collisions seem likely. So the question would be which hashing algorithm would be optimal for avoiding collisions? &lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9987239" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9984665</link><pubDate>Wed, 24 Mar 2010 22:14:28 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9984665</guid><dc:creator>JM</dc:creator><description>&lt;p&gt;@Paul: no, this is pointless. Computing the hash of a string requires looking at every character, and then you might as well compare those pairwise while you're at it. If you have to do expensive culture-sensitive equality comparisons it *might* pay off, but it's the kind of optimization you'd have to be very sure about (per Anon, measuring carefully would not be optional). It's highly unlikely culture-sensitive string comparisons are a bottleneck in your application -- and if they are, it's more likely you want them to be culture-insensitive instead in any case.&lt;/p&gt;
&lt;p&gt;As to why hashing uses every character: it doesn't need to, but if you don't look at every character, collisions become extremely likely for some input sets (namely the ones that have identical characters in whatever positions you're checking, which is very common with generated strings). In general that's not a good way of getting a performance gain, and String.GetHashCode() dutifully looks at every character.&lt;/p&gt;
&lt;p&gt;People sometimes suggest that the framework should pre-computer or cache hash values of strings to speed up comparisons (or just hashing in general), but this is silly for the simple reason that this would penalize *every* string creation for the sake of speeding up repeated comparisons. This doesn't pay off -- string creation is vastly more common than repeated string comparison against the same string instance (mind you, it needs to be the same *instance*, we can't know it's the same *characters* as that's exactly the problem we're trying to solve). Of course, when stored in a dictionary, the hash value *is* kept along with the key, because we *do* have repeated comparisons. In C#, a string-based &amp;quot;switch&amp;quot; statement is implemented as a dictionary lookup by the compiler, again a situation where you do have repeated comparisons which is dutifully optimized. &lt;/p&gt;
&lt;p&gt;To summarize, computing hashes makes sense when you're hashing, it doesn't make sense in general. Sounds obvious, but in my experience it's remarkably hard to convince people of this -- they get fixated on the &amp;quot;comparing hash values is faster&amp;quot; angle without being able to consider the total cost of all operations.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9984665" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9984646</link><pubDate>Wed, 24 Mar 2010 21:41:31 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9984646</guid><dc:creator>Random832</dc:creator><description>&lt;p&gt;&amp;gt;&amp;gt; Everyone talks about version 1 GUIDs (time &amp;amp; mac address), but almost all software generates version 4 GUIDs (random numbers). &amp;nbsp;I've experienced at LEAST 10 duplicate GUIDs in the last 8 years.&lt;/p&gt;
&lt;p&gt;That is statistically very unlikely (even if it's not impossible) unless they're being generated in a particularly poor manner.&lt;/p&gt;
&lt;p&gt;note that using System.Random or C's rand() function or anything else that only starts with a 32-bit seed qualifies - you've basically got a 32-bit random* number (the original seed) that's been stretched out to 122 bits without adding any more real randomness, since all your bits are dependent on that original seed so you can only generate 2^32 possible GUIDs. You'll have that same high probability of collisions with semi-random GUIDs generated in the same way by the same implementation, and an unknown (somewhere between that high probability and zero) probability between it and different implementations.&lt;/p&gt;
&lt;p&gt;*or not even really so, since the default seed for System.Random is time-dependent... and rand()'s _default_ seed is fixed, but its _typical_ seed is going to also be time-dependent&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9984646" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9984576</link><pubDate>Wed, 24 Mar 2010 19:29:42 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9984576</guid><dc:creator>Jack</dc:creator><description>&lt;p&gt;Its useful to point out that 23 = sqrt(365), and that the point at which it is likely (&amp;gt;=50%) that you will have a collision is at the squareroot of the size of your keyspace. &lt;/p&gt;
&lt;p&gt;Following from this, you can trivially calculate the tipping point for any hash given its length, as it is 2**(hashlength/2)&lt;/p&gt;
&lt;p&gt;Futhermore, you chances are that good if you assume random uniform distribution. When you're hashing a bunch of items, they probably have things in common that make them less than perfectly random.&lt;/p&gt;
&lt;p&gt;Moral of the story: Be careful using hashes as UIDs, but if you're using a 128bit hash you will *probably* be fine.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9984576" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9984420</link><pubDate>Wed, 24 Mar 2010 15:40:43 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9984420</guid><dc:creator>Drew</dc:creator><description>&lt;p&gt;Everyone talks about version 1 GUIDs (time &amp;amp; mac address), but almost all software generates version 4 GUIDs (random numbers). &amp;nbsp;I've experienced at LEAST 10 duplicate GUIDs in the last 8 years.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9984420" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9983792</link><pubDate>Tue, 23 Mar 2010 18:15:06 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9983792</guid><dc:creator>Gabe</dc:creator><description>&lt;p&gt;As a quick first-order approximation, cut the number of bits in half to get a 50% chance of a collision and shift right 3 bits to get a 1% chance. E.g. if you have 2^32 sock colors, choose approximately 2^16 socks to get a 50% chance of a duplicate or 2^13 socks to get about a 1% chance of a duplicate color.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9983792" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9983648</link><pubDate>Tue, 23 Mar 2010 15:06:13 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9983648</guid><dc:creator>Anthony W Pegram</dc:creator><description>&lt;P&gt;There are also multicolored socks. For example, former-president Clinton's cat.&lt;/P&gt;
&lt;DIV class=yellowbox&gt;
&lt;P&gt;The former president's former cat. Socks died in 2009. -- Eric&lt;/P&gt;&lt;/DIV&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9983648" width="1" height="1"&gt;</description></item><item><title>re: Socks, birthdays and hash collisions</title><link>http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx#9983484</link><pubDate>Tue, 23 Mar 2010 09:44:03 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9983484</guid><dc:creator>Daniel Earwicker</dc:creator><description>&lt;p&gt;jsrfc58 said: To complicate matters, there are also other types of socks that are not worn--wind socks, drift socks, etc.&lt;/p&gt;
&lt;p&gt;Eric said: Clearly more research needs to be done on this problem&lt;/p&gt;
&lt;p&gt;I've been trying to think of a way of punning on garter and Gartner for too many minutes now. It just ain't gonna happen.&lt;/p&gt;
&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9983484" width="1" height="1"&gt;</description></item></channel></rss>