How can I speed up hashtable lookups with struct object as keys?

Published 20 March 06 06:54 PM

When you have struct objects as the key in a hashtable, the lookup operation of the hashtable performs miserably. This can be attributes to the GetHashCode() function which is used internally to do the lookup.

If a struct contains only simple value types (int, short, etc.), the algorithm which computes the GetHashCode creates hashes which fall into mostly the same bucket.

Example, lets say, the hashtable creates 10 buckets. Then, most probably, all the keys are being put into the same bucket. Hence when a lookup is performed, the .NET runtime has to traverse through this entire bucket to get the value.

BUCKET1 - Value1, Value2, Value3,...,valuen
BUCKET2 - (empty)
BUCKET3 - (empty)
BUCKET4 - (empty)
BUCKET5 - (empty)
BUCKET6 - (empty)
BUCKET7 - (empty)
BUCKET8 - (empty)
BUCKET9 - (empty)
BUCKET10- (empty)

Hence, instead of the lookup operation being O(1), it becomes O(n) on an average case.

To overcome this drawback, consider overriding the GetHashCode() function and making the life easier for the .NET Runtime.

An example would be to create a string by merging all your value types in the struct and joining them by using a special character as a demarcator.

Since your struct is a lookup criteria, it is sure that all the struct values will be different, and hence the string generated is guaranteed unique.

Now the string generated has a method(GetHashCode()), since it is derived from System.Object (like all other objects). Just return the output of this API. A code example will help to understand better.

struct
{
	int a;
	short b;
	public struct(int _a, short _b)
	{
		a = _a;
		b = _b;
	}
	public override int GetHashCode()
	{
		string hash = a.ToString() + ":" b.ToString();
		return hash.GetHashCode();
	}
}

Since you are generating hashcode explicitly which is guaranteed to be unique, it will boost the performance of your lookup.

Reference: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemobjectclassgethashcodetopic.asp

[author: Vipul Patel, C# MVP]

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# nat said on March 21, 2006 6:29 AM:
Hi Patel,

Will it actually slow things down as you have to concatenate two strings while have to get the hashcode from string? It seems to be an expensive operation. Would it be better to multiply the value with some prime value like in other sample?

public override int GetHashCode()
{
return a.GetHashCode() * 29 + b.GetHashCode();
}
# Jason Haley said on March 21, 2006 8:34 AM:
# C#, VS Deployment and all geek talk said on March 21, 2006 11:53 AM:
 
Check out
http://blogs.msdn.com/csharpfaq/archive/2006/03/20/556192.aspx 
for a FAQ on...
# vipul said on March 21, 2006 12:04 PM:
Hi Nat,

I agree that concatenate operations on strings are expensive.

But the lookup on struct if GetHashCode() is not overridden is very time-consuming. I got the idea of an faq about facing a real-life problem.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemobjectclassgethashcodetopic.asp mentions the above point.
# mashiharu said on March 21, 2006 2:20 PM:
After thinking about this for a while I wonder what you're trying to accomplish.

My first problem is your statement:

"Since you are generating hashcode explicitly which is guaranteed to be unique, it will boost the performance of your lookup."

Hashes are not unique.
http://en.wikipedia.org/wiki/Hash_function
# vipul said on March 21, 2006 3:19 PM:
Hi mashiharu,

The purpose of this faq is to enable better lookup for structs. I will get in touch with the csharpfaq team to get the subject title of this post corrected.

Basically when you are trying to use struct objects as-is in the hashtable as a the key, the lookup is terrible. The purpose of the above code snippet is an example on how to generate hash code explicitly. Here, I have taken example of creating a string by concatenating all the fields of the struct and joining them using a predetermined key ':'.

Using string is just an example, you can use whatever mechanism you want, as long it guarantees
a. A hashcode generated for a struct type is unique and same, no matter how many times it is generated.
b. If your two struct objects are same, then the hash code generated should be the same.

Hope it clarifies. Please feel free to suggest any changes which will result in a better example. We are all here to learn and share.

Thanks
Vipul
# nat said on March 21, 2006 3:27 PM:
My point is that using string to do hashing may actually slow your code down rather than making it faster especially in scenario of small data points or when the first field in the struct is well dispersed.

BTW, your perception on GetHashCode may be a little bit incorrect. GetHashCode on struct (or ValueType in C#) is overridden inside ValueType object. The default implementation is to get hash code from the first field only. Please read this for more detail... http://www.informit.com/content/images/0321245660/items/wagner_item10.pdf

But in Roter, it seems to use every field to calculate hash code though...
http://dotnet.di.unipi.it/Content/sscli/docs/doxygen/fx/bcl/valuetype_8cs-source.html
# vipul said on March 21, 2006 3:27 PM:
Hi mashiharu,

Asnwering your original question, the string I am generating is bound to be unique as it is being generated from a set of struct objects which are used as a key for the hashtable. And since the string is unique, the GetHashCode on the string will definitely generate a unique hash code.

Thanks
Vipul
# vipul said on March 21, 2006 3:49 PM:
hi Nat,

<quote>
especially in scenario of small data points or when the first field in the struct is well dispersed.
<quote>

This is explicitly the condition which was not available to me for a real-life situation. My struct was composed of 11 fields(short, int, short, int,....)

and many(about 15%) of the value types had the same first field, they used to differ on the other fields. With 1,000,000+ records, my performance was dismal.  

<quote>
GetHashCode on struct (or ValueType in C#) is overridden inside ValueType object. The default implementation is to get hash code from the first field only<quote>
In the reference, you have a string as the first member, which is quaranteed to be unique. That may not the case with other structs on other examples.
# nat said on March 21, 2006 3:58 PM:
btw, good news or not I don't know... .NET 2.0 seems to change the behavior to be what Rotor does. So you get the same behavior but it will be slower as it uses reflection to do the job.
# mashiharu said on March 21, 2006 4:27 PM:
"And since the string is unique, the GetHashCode on the string will definitely generate a unique hash code."

Wrong.

Try these examples:

Console.WriteLine("0:93121".GetHashCode());
Console.WriteLine("0:2546870".GetHashCode());

# mashiharu said on March 21, 2006 4:34 PM:
Or if you are on .net 1.1:

Console.WriteLine("0:22980214".GetHashCode());
Console.WriteLine("0:12538469".GetHashCode());
# vipul said on March 21, 2006 4:49 PM:
mashiharu,

yes, there is something fishy about ':'.  I beleive it has something to do with the formating specifier.

I tried concatenating with '%' and '_' and it generated unique codes. Can you please verify, if they are also prone to a loop-hole. After your response, I will update the FAQ article accordingly. We need readers to have the best solution.
# mashiharu said on March 21, 2006 4:59 PM:
No, there is no problem with the ":" and it was a clever way to ensure the string are unique.  The hash however is not (see wikipedia link above).

Example: You have a 2-bit number (0-3) which you want to generate a 1-bit (0-1) unique key for.

0 -> 0
1 -> 1
2 -> ?
3 -> ?

Can't be done.  

You get the same problem when you try to take a 32 + 16 = 48 bit number and turn into a 32-bit unique key.
# vipul said on March 21, 2006 5:16 PM:
Mashiharu, can you please suggest a good workaround which can be used in the FAQ sample?
# nat said on March 21, 2006 5:19 PM:
why don't you use the way i suggest? It should work...and it performs much better.

public override int GetHashCode()
{
return a.GetHashCode() * 29*29 + b.GetHashCode()*29 + c.GetHashCode() + ....;
}
# nat said on March 21, 2006 5:23 PM:
If you say so, then you might need to change the wording in your blog that "Since you are generating hashcode explicitly which is guaranteed to be unique". The key to generate code may be unique in your case but not the hash code itself.
# Omer van Kloeten said on March 23, 2006 3:09 AM:
The Framework Design Guidelines specify that the maximum size of a struct should be around 16 bytes. That's around four integer fields, which is probably a lot less than the 11 fields you have.
Maybe the performance hit should be solved by changing the struct to a class...
# Dotmad (on .Net) said on March 14, 2008 8:33 AM:

After going this week to the Microsoft performance open house , here are few things to consider: Create

# Great C# Resource! &laquo; Alexander The Great said on April 15, 2008 3:10 PM:

PingBack from http://alexandersarchive.wordpress.com/2008/04/15/great-c-sharp-resource/

# Brian said on October 30, 2009 6:57 AM:

The easiest way to speed up hashcode lookups on any object is to override the GetHashCode function to return an int that is generated at object creation time.  This makes it so the HashCode only gets generated once (when the object is created) rather than every time the GetHashCode function is called.  There are issues with using this technique with mutable objects...but this is true even if you don't use this technique.

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required

This Blog

Syndication

Page view tracker