Sign In
Scalability Notes
[Read -> Think -> Write]
Translate This Page
Translate this page
Powered by
Microsoft® Translator
Options
Blog Home
Email Blog Author
Share this
RSS for posts
RSS for comments
Search
Advanced search options...
Search In:
Everything
Blogs
Forums
People
Groups
Places
Pages
Date range:
All Time
Last Year
Last 6 Months
Last 3 Months
Last Month
Last Week
Last Two Days
Tags
database
distributed system
engineering
hpc
network
parallel
scalability
search
Archive
Archives
December 2010
(1)
September 2010
(1)
August 2010
(1)
April 2010
(1)
February 2010
(2)
January 2010
(4)
December 2009
(1)
November 2009
(1)
October 2009
(1)
September 2009
(1)
August 2009
(4)
June 2009
(2)
May 2009
(1)
April 2009
(1)
March 2009
(2)
February 2009
(4)
January 2009
(1)
Consistent Hashing - Theory & Implementation
MSDN Blogs
>
Scalability Notes
>
Consistent Hashing - Theory & Implementation
Consistent Hashing - Theory & Implementation
changl
17 Sep 2009 1:45 AM
Comments
0
Consistent Hashing - Theory & Implementation
What's it?
The consistent hashing comes from the solving of hot spot problem in Internet system, I.E., it comes from the distributed cache system. [1][2]
The idea is simple and straight forward (more detail in paper[1][2]):
Hot Spot
->
Centric Cache
->
Distributed Cache
(Communicate using IP Multicast) ->
Harvest Cache
(Tree Structure Cache, structured communication)[9] ->
Random Tree Cache
(different Tree for different cached Objects, hash mapping from tree node to machine node) ->
Random Tree + Consistent Hashing Cache
(deal with machine node dynamics: machine may leave/down and join/recover)
Essentially, a
Consistent Hash Function
is one that changes minimally as the range of the function changes. A hash function can be looked as a mapping from items to buckets, suppose we have such a mapping m1. If some buckets are added/removed, we have another mapping m2. In normal hash functions, m1 -> m2 will introduce many item movements (from one bucket to another). A consistent hash function has the characteristic that item movements are minimal.
How does it accomplish that?
In consistent hash function:
1. Items and buckets are all hashed(normal hash) into some integer interval (for example: [0, 1024]).
2. Item is mapped to a bucket that is closet to it.
3. Bucket may be replicated to improve even distribution balance.
NOTE: here "
closet
" means the first bucket it meets when traverse clock wise along the integer circle (see diagram below)
Suppose the normal hash output range is [0, 1023], you have 4 buckets and 8 items, each bucket is replicated twice. One possible consistent hashing may be illustrated as below:
Current Mapping/Hashing:
Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item7
Bucket3 - Item4, Item6
Bucket4 - Item5
If Bucket3 is down/broken, the new Mapping/Hashing will be:
Bucket1 - Item1, Item3, Item8
Bucket2 - Item2, Item6, Item7
Bucket4 - Item4, Item5
You can see that only Items on Bucket3 are changed, and they are distributed among the remaining buckets.
If a new Bucket5 is added, you can see that only small number of items are changed, and the new bucket gets load from the original 4 Buckets.
How to Implement it?
Normal hash function is stateless, the return value only determines by the input parameter, but the consistent hash function is a stateful function. The state is how the buckets are arranged on the integer circle.
It's natural to store how the buckets are arranged on the integer circle as a search tree, since it's in fact a typical search algorithm - you need to know which segment an item (its hash value) belongs to.
In practical system, this stateful data structure will be stored on each client that uses this consistent hash function. As buckets(machine nodes) join and leave, the state will change. But different client may see the join/leave at different time, or even in different order, thus will produce different hash value using the same consistent hash function(It's said to have different view in paper[1]).
But it is proven in [1], that the number of buckets that one item may belong and the number of items that on one particular bucket won't be very large (the so called
Spread
/
Load
property).
A C++ version of consistent hashing function can be found
here
, it uses STL map as the binary search tree.
The impact of the bucket replica number can be visualized as below (code can be found in
testmain.cxx
):
You can see that as the replica count increases, the item distribution over buckets will become more and more even.
[Reference]
1. Theory Paper
Consistent hashing and random trees
: distributed caching protocols for relieving hot spots on the World Wide Web
2. Practical Paper
Web Caching with Consistent Hashing
3. Blog
About Consistent Hashing with Java Code
4. Blog
Understanding Consistent Hash
5.
http://en.wikipedia.org/wiki/Consistent_hashing
6.
http://en.wikipedia.org/wiki/Distributed_hash_table
7.
The Chord P2P system
8.
A Hierarchical Internet Object Cache
0 Comments
scalability
Blog - Comment List MSDN TechNet
Comments
Loading...
Leave a Comment
Name
Comment
Please add 6 and 4 and type the answer here:
Post