Welcome to MSDN Blogs Sign in | Join | Help

Introducing HashSet<T> [Kim Hamilton]

HashSet<T> is in our latest CTP, and you can find it in the System.Collections.Generic namespace. The naming discussion over the last month has motivated me to recap some naming highlights for HashSet, so hang in til the end if you’re interested.

HashSet is an unordered collection containing unique elements. It has the standard collection operations Add, Remove, Contains, but since it uses a hash-based implementation, these operation are O(1). (As opposed to List<T> for example, which is O(n) for Contains and Remove.) HashSet also provides standard set operations such as union, intersection, and symmetric difference.

    HashSet<int> theSet1 = new HashSet<int>();
    theSet1.Add(1);
    theSet1.Add(2);
    theSet1.Add(2);
    // theSet1 contains 1,2

    HashSet<int> theSet2 = new HashSet<int>();
    theSet2.Add(1);
    theSet2.Add(3);
    theSet2.Add(4);
    // theSet2 contains 1,3,4

    theSet1.UnionWith(theSet2);
    // theSet1 contains 1,2,3,4

HashSet’s default Add operation returns a bool letting you know whether the item was added, so in the code sample above, you could check the return type to check whether the item was already in the set.

    bool added = theSet1.Add(2); // added is true
    added = theSet1.Add(2); // added is false

If you’re familiar with our ICollection<T> interface, notice that this means ICollection<T>.Add (returning void) has an explicit implementation, allowing HashSet<T> to introduce its own Add.

A note on uniqueness: HashSet determines equality according to the EqualityComparer you specify, or the default EqualityComparer for the type (if you didn’t specify). In the above example we didn’t specify an EqualityComparer so it will use the default for Int32. In the next example, we’ll use an OddEvenComparer, which considers items equal if they are both even or both odd.

    class OddEvenComparer : IEqualityComparer<int> {
        public OddEvenComparer() {}
        public bool Equals(int x, int y) {
            return (x & 1) == (y & 1);
        }

        public int GetHashCode(int x) {
            return (x & 1);
        }
    }

    ...

    // Now use the comparer
    HashSet<int> oddEvenSet = new HashSet<int>(new OddEvenComparer());
    oddEvenSet.Add(1);
    oddEvenSet.Add(3);
    oddEvenSet.Add(4);
    // oddEventSet contains 1,4; it considered 1 and 3 equal.

Notice the name UnionWith in the first example. UnionWith, as with the other set operations, modifies the set it’s called on and doesn’t create a new set. This distinction is important because the Linq operations Union, Intersect, etc on IEnumerable create a new set. So HashSet’s methods aren’t duplicating Linq; they’re provided in case you want to avoid creating a new set, and they’re distinguished by the With suffix.

Now for some naming fun, which will demonstrate some other framework guidelines. We would have liked to name this feature Set. This is because it’s preferred to use a common name rather than one that reveals details about the implementation. To borrow an example from Krzysztof Cwalina and Brad Abram’s book Framework Design Guidelines, a type used to submit print jobs to a print queue should be named Printer, and not PrintQueue. Applying this guideline to this class – HashSet, while more technically precise, isn’t as recognizable at Set. You can see this guideline in other class names in the System.Collections.Generic namespace: List<T> instead of ArrayList<T>, Dictionary<T> instead of Hashtable<T>.

This brings up the question of whether naming it Set would have been a bad idea if we add other sets in the future, such as an OrderedSet. However, a hash-based unordered set can reasonably be considered the “go-to” set because of its good performance, so distinguishing it with the name Set would still be acceptable.

Any guesses as to why we didn’t go with the name Set?

Published Thursday, November 09, 2006 10:43 AM by BCLTeam

Comments

Thursday, November 09, 2006 2:04 PM by Wesner Moise

# re: Introducing HashSet<T> [Kim Hamilton]

FXCop rule prohibits naming a type after a language keyword. "set" is also used in C#, but also in VB6, though it is either gone or optional in VB.NET.

It would be nice to support operator overloading for sets.

Thursday, November 09, 2006 2:04 PM by Alan Dean

# re: Introducing HashSet<T> [Kim Hamilton]

Because "set" is a reserved contextual keyword

http://msdn2.microsoft.com/en-gb/library/x53a06bb.aspx

Thursday, November 09, 2006 2:05 PM by Tom

# re: Introducing HashSet<T> [Kim Hamilton]

I would strongly prefer the name Set instead of HashSet.  As you state, HashSet refers too strongly to the implementation.  Users of the class don't really care about the implementation, we just care about the functionality and performance characteristics.

Thursday, November 09, 2006 3:34 PM by TheMuuj

# re: Introducing HashSet<T> [Kim Hamilton]

Even though you no longer have to use Set when assigning references in VB, it is still a keyword.  It is used in declaring property accessors:

Property Text() As String

Get

 Return _text

End

Set(ByVal value As String)

 _text = value

End Set

End Property

I'm not sure if Set is a true keyword or a contextual keyword, but either way there's potential for confusion.

Thursday, November 09, 2006 4:19 PM by Timothy Fries

# re: Introducing HashSet<T> [Kim Hamilton]

PowerCollections uses Set<T>... I can't say I've ever had a problem with keyword collision with it.

Thursday, November 09, 2006 4:57 PM by DotNetKicks.com

# Introducing HashSet

You've been kicked (a good thing) - Trackback from DotNetKicks.com

Thursday, November 09, 2006 5:48 PM by Kael's WebLog

# System.Collections.Generic."Hash"Set?

HashSet is a new generic collection that has been added to the System.Collections.Generic namespace.

Thursday, November 09, 2006 11:30 PM by dono

# re: Introducing HashSet<T> [Kim Hamilton]

> Any guesses as to why we didn’t

> go with the name Set?

I assume you mean Set<T>.

That makes the VB argument mostly irrelevant.

I am _really_ glad that you did not call it Set<T>. I eagerly await the day that ISet<T> is introduced along with other set types, TreeSet<T>, OrderSet<T> etc.

However, I think it is a mistake to introduce HashSet<T> now without an ISet<T>. A lot of code will need to be rewritten later (such as updating signatures) because of this choice.

I hope the type is unsealed. Lets say I want to design a TreeSet<T>. It really does not make sense for me to extend HashSet<T> to do this. Yet without an ISet<T>, there really aren't any other options. I could ignore HashSet<T> entirely and write everything from scratch. However, that does not fit well into the _framework_.

Please reconsider this issue.

Friday, November 10, 2006 4:58 AM by Doug

# re: Introducing HashSet<T> [Kim Hamilton]

A Set type is great to have. I've gotten by with Dictionary<KeyType, object> so far (leaving the value as null), but this is even better.

Another type that I use often is Dictionary<KeyType, List<ValueType>> - a key maps to a set of zero or more values. I would love to see something like this in the BCL.

Friday, November 10, 2006 6:46 AM by Israel Aece

# re: Introducing HashSet<T> [Kim Hamilton]

Hello Kim,

When and where this collection will be available?

Friday, November 10, 2006 6:50 AM by Israel Aéce

# System.Collections.Generic.HashSet

The BCL Team work in a new type collection called (temporary) HashSet that is a collection containing

Friday, November 10, 2006 7:18 AM by Sébastien Ros

# Then a HashList ?

I think this is a great idea, as it is really a missing feature currently. Thus why don't you go a step beyond by creating an HashList which would be o(1) (or did I miss it)

Friday, November 10, 2006 1:38 PM by Kim Hamilton

# re: Introducing HashSet<T> [Kim Hamilton]

Hi Isreal,

It's in our latest Orcas CTP, which you can download here:

http://www.microsoft.com/downloads/details.aspx?FamilyId=C09B5A2D-EB6A-44B6-8BBD-3764A2FDA9CE&displaylang=en

It's in System.Core.dll, in the System.Collections.Generic namespace.

Thanks,

Kim

Monday, November 13, 2006 2:03 AM by Kim Hamilton

# re: Then a HashList?

Hi Sebastien,

I'm interpreting HashList as a collection in which you can perform both index-based lookups and hash-based lookups. In other words, a generalization of NameValueCollection (which accepts only Strings).

We haven't gotten a lot of requests for this type of collection, and as with NVC, it would have some additional overhead. So we'd still recommend picking one or the other (List or HashSet) if you don't actually require both types of lookups. See the related blog on NVC overhead here:

http://blogs.msdn.com/bclteam/archive/2006/09/05/741660.aspx

Thanks,

Kim

Monday, November 13, 2006 2:15 AM by Kim Hamilton

# re: Introducing HashSet<T> [Kim Hamilton]

Hi Dono,

>> A lot of code will need to be rewritten later (such as updating signatures) because of this choice.

We had a lot of discussion on this point during the design of HashSet (which Justin summarized); we'll most likely follow up soon with some more details.

Thanks,

Kim

Saturday, November 18, 2006 4:03 AM by Sébastien Ros

# re: Introducing HashSet<T> [Kim Hamilton]

Kim,

By HashList I was referring to an IList<T> implementation which would be O(1) for Add, Remove and Contains. Isn't it one of the main matters of HashSet ?

Wednesday, November 22, 2006 11:08 AM by diegov

# re: Introducing HashSet<T> [Kim Hamilton]

What I don't get is the reason you change the semantics of Add so lightly.

I would prefer that you were more explicit. My suggestion:

Create a public bool TryAdd() and then a void Add() that throws on duplicates as usual. Of course, you can implement Add() as a wrapper around TryAdd() if convinient.

Wednesday, November 22, 2006 2:03 PM by Kim Hamilton

# re: Introducing HashSet<T> [Kim Hamilton]

Hi Diegov,

Earlier versions of our design used Add/TryAdd, but we removed it to be consistent with the set interpretation of add. Even though a set is maintained such that it doesn't contain duplicate items, attempting to add a duplicate isn't exceptional. We provided the Add that returns bool for cases where the user may want different control flow based on whether the item was already present in the set, but this shouldn't be achieved through exceptions (with their associated perf cost, etc).

So in fact, this isn't a change of semantics of Add. You can think of it this way -- if the items are equal according to the equality comparer, then it's an implementation choice whether the original item is replaced by the new duplicate item. After Add is finished, the item is present in the set.

I should also mention that we came to this decision only after much debate. Using Add/TryAdd was a common suggestion, but more than 2/3rds of people reviewing the API thought that having Add throw on duplicates would be surprising for a set.

Some readers may be wondering why IDictionary throws on duplicate adds. That's because it takes both a key and a value, and although the key may be the same, the values may differ.

Thanks,

Kim

Wednesday, November 22, 2006 2:13 PM by Kim Hamilton

# re: Introducing HashSet<T> [Kim Hamilton]

Hi Sebastien,

>> Isn't it one of the main matters of HashSet ?

Not exactly; HashSet is unordered and doesn't contain duplicates, but an IList is ordered and may contain duplicates.

Thanks,

Kim

Wednesday, November 22, 2006 3:30 PM by diegov

# re: Introducing HashSet<T> [Kim Hamilton]

Well, I guess the majority rules :) Anyway, the use of IEqualityComparer<T>.GetHashCode() gives the slight feeling that this is also a dictionary, only the key is function of the value. I guess then, it is not completely possible to abstract of the fact that HashSet is implemented as with a hash. Which is another reason not to call it Set :)

Wednesday, February 07, 2007 6:50 PM by Daniel Moth

# System.Collections.Generic.HashSet

System.Collections.Generic.HashSet

Thursday, February 15, 2007 9:41 AM by Mike Chaliy's Blog

# .Net Framework 3.5 - System.Core.dll

Незабаром в нас буде нова версія .Net Framework 3.5, яка схоже, як і 3.0, буде спиратися на базовий .Net

Tuesday, June 12, 2007 9:47 AM by James Manning's blog

# (Hash)Set<T> in the 3.5 BCL

Since it's coming up on 2 years since I complained that the BCL didn't have Set&lt;T&gt; I figured I

Wednesday, August 08, 2007 3:22 PM by Patrick Smacchia [MVP C#]

# New .NET 3.5 core stuff

While installing VisualStudio 2008 Beta2 I was surprised that the NET framework 2.0 installation got

Tuesday, September 11, 2007 11:42 PM by Michael Freidgeim's Blog

# Generic function to removeDuplicates from Generic List

Generic function to removeDuplicates from Generic List

Thursday, September 13, 2007 2:32 PM by Jack Gudenkauf (JackG) WebLog

# What’s new in the .Net Framework 3.5 (Orcas)

Visual Studio (VS) 2008 and .Net 3.5, code named “Orcas”, will be released in the near future. This release

Monday, October 22, 2007 7:30 PM by Dariusz quatscht

# Nachlese: TechTalk - Visual Studio 2008 & .NET 3.5

Mein TechTalk ist nun zu Ende. Meine letzte Station heute in Berlin war Lustig und Amüsant, ich hoffe

Wednesday, November 07, 2007 6:25 AM by Pedram Rezaei's Ramblings

# System.Core.dll of .NET Framework 3.5

I recently wrote an MSDN Flash article on the hidden gems of System.Core.dll of .NET Framework 3.5 which

Wednesday, November 07, 2007 7:10 AM by Noticias externas

# System.Core.dll of .NET Framework 3.5

I recently wrote an MSDN Flash article on the hidden gems of System.Core.dll of .NET Framework 3.5 which

Monday, November 19, 2007 7:42 PM by BCL Team Blog

# .NET Framework 3.5 Now Available! [Justin Van Patten]

.NET Framework 3.5 and Visual Studio 2008 have officially shipped! Soma has the announcement on his blog

Tuesday, April 22, 2008 3:03 PM by Rik Robinson's Blog

# The New Generic Kid on the Block - HashSet&lt;T&gt;

In mathematics, a Set is typically thought of as a collection of distinct objects that is usually defined

New Comments to this post are disabled
 
Page view tracker