A Set<T> Class? [Ari Weinstein]

A Set<T> Class? [Ari Weinstein]

  • Comments 16

Recently, I have heard a number of requests for the addition of a Set<T> class to the System.Collection.Generic namespace.  Basic implementation of this class that implements only ICollection<T> would cover all the etremely vital operations a set should implement, but that would just be a weak List<T>. We want to cover more interesting functionality; at least Union, Intersect, Difference, Subset, and Disjoint.

 

While any person using set would expect these operations from a set class, there are many additional interesting options.  Of them are symmetric difference, mapping/correspondence, predicates, min/max, and powersets.  Does anyone out there need so much goodness, or is it a bit more than what you would need?

 

Other questions were more centered around the implementation.  A set based on a hash would have extremely good performance with near-constant insert and remove, but would have a  slow search time O(n).  An ordered set, on the other hand, would probably take O(log n) for insert, search, and remove.  This means that for most applications, the unordered list provides better results. Of course, the speed of these operations has serious impact on all of the other set operations.  We may implement both.

 So again, to solicit some more user comments (and thanks for the past ones!):

 ·                                 Would the addition of a set class to the BCL make sense for the programs you are writing?  Would you rather see something else implemented?

·                                 How much functionality would you like to see in the class; do the members in the prototype below satisfy your needs, or are you looking for more (or less)? 

·                                 Do you see yourself using sets for ordered or unordered collections of values, or both?

 

Here's the source for a simple program that does some basic operation on odd and even sets of integers, it highlights the most basic operations one may want to do with a set:

 

Check it out and respond up here or send me an email at ArielW@microsoft.com!

 

 

using System;

using System.Collections;

using System.Collections.Generic;

class Starter{

static void Main(){

/* Produces the Following output:

Set of even integers under 10:

0

2

4

6

8

Set of odd integers under 10:

1

3

5

7

9

Do the even and odd sets intesect?

no

Even union odd integers:

0

2

4

6

8

1

3

5

7

9

Are even integers a subset of all integers?

yes

Are all integers a subset of even integers?

no

Difference of All integers from Odd integers:

0

2

4

6

8

*/

Set<int> EvenSet = new Set<int>();

Set<int> OddSet = new Set<int>();

for(int i=0; i<10; ++i){

if (i % 2 == 0)

EvenSet.Add(i);

else

OddSet.Add(i);

}

Console.WriteLine("Set of even integers under 10:");

foreach (int i in EvenSet)

Console.WriteLine("\t" + i);

Console.WriteLine("\nSet of odd integers under 10:");

foreach (int i in OddSet)

Console.WriteLine("\t" + i);

Console.WriteLine("\nDo the even and odd sets intesect?");

if(EvenSet.IntersectsSet(OddSet))

Console.WriteLine("\tyes");

else

Console.WriteLine("\tno");

Console.WriteLine("\nEven union odd integers:");

Set<int> EvenAndOddSet = EvenSet.Union(OddSet);

foreach(int i in EvenAndOddSet)

Console.WriteLine("\t" + i);

Console.WriteLine("\nAre even integers a subset of all integers?");

if (EvenSet.IsSubsetOf(EvenAndOddSet))

Console.WriteLine("\tyes");

else

Console.WriteLine("\tno");

Console.WriteLine("\nAre all integers a subset of even integers?");

if (EvenAndOddSet.IsSubsetOf(EvenSet))

Console.WriteLine("\tyes");

else

Console.WriteLine("\tno");

Console.WriteLine("\nDifference of All integers from Odd integers:");

foreach (int i in EvenAndOddSet.Difference(OddSet))

Console.WriteLine("\t" + i);

}

}

public class Set<T> : ICollection<T>{

private List<T> list;

public Set()

{

list = new List<T>();

}

//

//ICollection<T> methods

//

public void Add(T item)

{

if(!list.Contains(item))

list.Add(item);

}

public void Clear()

{

list.Clear();

}

public bool Contains(T item)

{

return list.Contains(item);

}

public void CopyTo(T[] array, int arrayIndex)

{

list.CopyTo(array, arrayIndex);

}

public bool Remove(T item)

{

return list.Remove(item);

}

public int Count

{

get {return list.Count;}

}

public bool IsReadOnly

{

get { return false; }

}

IEnumerator System.Collections.IEnumerable.GetEnumerator()

{

// TODO: Add Class1.System.Collections.IEnumerable.GetEnumerator implementation

return list.GetEnumerator();

}

IEnumerator<T> System.Collections.Generic.IEnumerable<T>.GetEnumerator()

{

return list.GetEnumerator();

}

public Set<T> Union(Set<T> otherSet)

{

Set<T> UnionSet = new Set<T>();

foreach (T item in this)

UnionSet.Add(item);

foreach (T item in otherSet)

UnionSet.Add(item);

return UnionSet;

}

public Set<T> Intersect(Set<T> otherSet)

{

Set<T> IntersectSet = new Set<T>();

foreach (T item in this)

if(otherSet.Contains(item))

IntersectSet.Add(item);

return IntersectSet;

}

public Set<T> Difference(Set<T> otherSet){

Set<T> DifferenceSet = new Set<T>();

foreach (T item in this)

DifferenceSet.Add(item);

foreach (T item in otherSet)

DifferenceSet.Remove(item);

return DifferenceSet;

}

public bool IsSubsetOf(Set<T> otherSet){

foreach (T item in this)

if(!otherSet.Contains(item))

return false;

return true;

}

public bool IntersectsSet(Set<T> otherSet)

{

return !IsDisjointFrom(otherSet);

}

public bool IsDisjointFrom(Set<T> otherSet){

foreach (T item in this)

if(otherSet.Contains(item))

return false;

return true;

}

}

  • I have to say I'm a fan of the Java approach to the Collections classes better than the one taken by the .NET framework so far. Admittedly I haven't looked into the .NET approach in great detail but at first glance there are a lot of things missing compared with the Java model.

    The approach taken by Java is to provide a rich set (no pun intended) of interfaces - Collection, Map, Set, List, SortedSet, SortedMap - and then provide a few alternative implementations of each interface with different performance characteristics.

    So you have ArrayList and LinkedList, TreeMap and HashMap, and TreeSet and HashSet. All of these can be used together and you can choose the implementation whose performance characteristics best suit your particular need.

    Want fast constant-time insertion and will always be iterating in order? Use LinkedList. Don't care so much about inserting but need fast random access? Use ArrayList. Etc.

    List<T> may be extremely well optimized but, assuming it's implemented as an array, it'll never be as good as a linked list for insertion at arbitrary points in a large list.

    The way this applies to your question is that, naturally, I believe that it would be nice to have an ISet<T> with *both* concrete implementations.

    FWIW, I'd also like to see common specialized collections like an LRU cache and a Dictionary that preserves order-of-insertion (at the expense of lookup performance) be part of the framework too...
  • The standard set operators like union, intersection, etc can be implemented wickedly fast if the sets are either sorted linked lists (then it can be done in-place) or sorted arrays.

    Just go through both lists at once, I assume you can figure out the rest.
  • I agree with the ISet<T> approach, but otherwise suspect that a set optimized for writing wouldn't be as useful as one optimized for reading (and, therefor, complicated calculations upon).

    For completeness, is there an implementation that possibly takes more memory, but remains balanced in terms of reading and writing. Depending on the application, the tradeoff may be worth the overhead.

    .. hence the ISet<T>
  • Your remark that a hash-based Set would have O(n) search times is incorrect. Search times would be constant, as would be insert and delete. An ordered set would have slower search times O(log n), because of the additional ordering requirement.

    The problem with a hash-based set is that results will be unordered or, at least, ordered chronologically according to time of insert.
  • I agree with the poster who suggests that multiple ISet<T> implementations are needed.

    I know ive needed a HashSet, a ListSet and more in my time.
  • An ordered set is more desperately needed than a hash-based set. A hash-based set can be faked up using a Dictionary<TValue,(whatever)>. An ordered set would be more efficient for vital operations like sorted-merge (aka union), intersection, subtraction etc.

    There is a need for an ordered multiset, as well as a tree-based associative array in order to make performance guarantees, rather than trusting to hashes in Dictionary<,>.
  • I haven't really had a really big need for a Set-class yet in my projects. I can image some uses though so it would be nice to have one, but since it is relatively easy to implement and since I wouldn't need it very often, I could live with having to implement one myself if I would need one. And as far as the complicated features are concerned (e.g. calculating the symmetric difference of two sets), can't say that I ever had a need for that. Naturally, it depends on what kind of projects one develops. I can imagine that somebody who's into developing games, financial or statistical software for example might have a need for such a class.
    However, while we're on the topic of enhancements to System.Collections and System.Collections.Generic: very often, I need a sort of Hashtable and Dictionary<T> that maintains the sortorder in which the elements were added.

    I currently implement such a class myself like this:
    public class MyHashtable: ICollection
    {
    private ArrayList mItems = new ArrayList();
    private Hashtable mKeys = new Hashtable();

    ...
    }

    mItems contains all the items in the original sortorder. mKeys contains the key and index in mItems of every item. This allows this class to be quickly searched using index and key.
  • I would prefer to see both types of sets included: a hash-set and a tree-based set.

    There's been some erroneous comments made before.

    A hash-set will perform much better than a tree-based set in virtually all cases. It will use less memory, has much better locality, and runs in constant time.

    It has lower constant and lower order. It's easy to see why: One does not make a data structure faster by adding work such as ordering.

    For a hash-set,
    Insert, Remove, Contains take O(1)
    Union, Difference, Intersection takes O(m+n)
    Enumeration O(N)

    For a tree-set,
    Insert, Remove, Contains take O(log(N))
    Union, Difference, Intersection takes O((M+N)log(M+N))
    Enumeration O(N)

    The hash-set has better performance, but may breakdown to linear-time for objects with poor hash functions. The tree-set provides guaranteed, but much slower average performance and it requires that the object implement IComparable.

    What I'd like to see is a WeakSet in which objects drop out after they are no longer referenced.
  • Michael, I believe that Dictionary<K,V> actually preserves the orders in which items are added, provided that no objects were previously removed.
  • A heap-based Priority Queue is also a nice add. That is a really easy add and actually requires less code than some of the other collection classes that have already been added.
  • I need a hash table that maintains insertion order, e.g. constant time searches with iteration as fast as possible. Note that the current SortedList class *does not* maintain insertion order.

    Also, here are some things that don't make much sense in the 1.1 API:
    1. IList doesn't include ToArray().
    2. IList doesn't include AddRange(), it's defined on ArrayList.
    3. SortedList doesn't implent IList. Why is it named list?
    4. ObjectCollection. WTF is this all about? How do I pass an IList to a method which wants one of these.
  • I haven't tought about this more than 1 minute, but why not create a Set class that would simply apply set operations on existing collections? Something like the Sort pattern?
  • I'm suprised you did not mentioned Peter Golde's PowerCollections project. I know he has worked closely with the BCL team on that. He's implemented Set<T>, with all the set operations, and a host of other collections mentioned in the comments here. In addition, he has a wonderful set of STL-style algorithms implemented as static methods that operate against the standard generic collection interfaces. I would simply like to see his implementations slurped into the BCL. They are quite elegant, and designed to work WITH the BCL rather than as a replacement for the existing collections. I am using them extensively.
  • lets see a non-blocking queue implementation as well.

    http://www.cs.rochester.edu/u/scott/synchronization/pseudocode/queues.html



  • Cool. I definitely like the idea of having a Set class built into the BCL! (Been wishing for that since 1.0...)

    In your first paragraph, you said that a Set without its own interface would basically just be a List<T>. This isn't right, because sets (by definition) don't contain duplicates. If you add the same element twice, the second add is just ignored.

    I've used set-like stuff before. Usually I've just used a Hashtable, but a genuine Set class would make the code more readable. The features I'd want would be foreach (with no ordering implied, just like Hashtable.Keys), Add, Remove, Contains, Union, Intersection, Clone (shallow copy), and possibly Difference.

    Anything that implies ordering (either keeping elements in the order they were originally added, or sorting them) is making the class into something other than a real set, so while such things might be useful on rare occasions, I don't see much value in putting them in the BCL. (If you're keeping the original order, but you add the same element a second time, does it stay in the original location, or move to the end? If you don't try to impose ordering onto the Set, life is so much simpler.) A Set that requires its elements to be sorted, in particular, seems like a bad idea, because most of the time, I'd be using objects that don't *have* any innate ordering -- accounts, printers, etc. -- and it'd be silly to impose ordering just to use a Set class.

    I definitely agree with what others have said about having an ISet interface. That way, even if there's only one Set implementation that ships with the BCL 2.0, there's still a standard interface, so others can write their own and still have them work with everyone else's code.
Page 1 of 2 (16 items) 12