UK Consulting Links
UK Consulting Blogs
As you may know when working with LINQ there is a method that allows one to return distinct elements from a sequence. Two prototypes are supported:
Distinct<TSource>(IEqualityComparer<TSource>)Distinct()
Unlike most sequence operations that take some form of a Func parameter, the Distinct() method takes an instance of an IEqualityComparer Type. It is the implementation of this Type that I will be discussing which will hopefully provide some insight into why this approach has been taken.
In a future posting, available here, I will provide a implementation of an Distinct() extension method to IEnumerable, which accepts a Func returning a property value. A performance comparison will also be performed.
I am going to demonstrate the Distinct() method using the following class definition:
public class Product{ // Key comparison elements public string Reference { get; set; } public ProductVersion ProductVersion { get; set; } // Other attributes not used on comparison public string Description { get; set; } // Ensure have a sensible string value public override string ToString() { if (this.ProductVersion == null) { return string.Format("Product: {0}", this.Reference ?? "Null Product"); } else { return string.Format("Product: {0}, Version: {1}-{2}-{3}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber); } } // Derive a unique identifier public string UniqueIdentifier { get { if (this.ProductVersion == null) { return this.Reference; } else { return string.Format("{0}:{1}-{2:00}-{3:00}", this.Reference, this.ProductVersion.VersionName, this.ProductVersion.VersionNumber, this.ProductVersion.SubVersionNumber); } } }} public class ProductVersion{ public int VersionNumber { get; set; } public int SubVersionNumber { get; set; } public string VersionName { get; set; }}
A distinct product is one defined where the Reference and ProductVersion values are unalike. The ProductVersion will also only be distinct if all its properties are unalike.
For testing and demonstration purposes the following code will be used:
// Build collectionList<Product> products = new List<Product>();products.Add(new Product() { Reference = "AA", Description = "Bike AA", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "A" } });products.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });products.Add(new Product() { Reference = "AB", Description = "Bike AB", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "B" } });products.Add(new Product() { Reference = "AC", Description = "Bike ACv1", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 0, VersionName = "C" } });products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } }); // Add some duplicatesproducts.Add(new Product() { Reference = "AA", Description = "Bike AAv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "A" } });products.Add(new Product() { Reference = "AA", Description = "Bike AAv3", ProductVersion = new ProductVersion() { VersionNumber = 2, SubVersionNumber = 0, VersionName = "A" } });products.Add(new Product() { Reference = "AC", Description = "Bike ACv2", ProductVersion = new ProductVersion() { VersionNumber = 1, SubVersionNumber = 1, VersionName = "C" } }); IEnumerable<Product> distinctProducts = products.Distinct<Product>(new ProductEqualityComparer()); foreach (Product product in distinctProducts.OrderBy(prod => prod.UniqueIdentifier)) Console.WriteLine(product.ToString());
The line creating the distinct collection is:
IEnumerable<Product> distinctProducts = products.Distinct<Product>(new ProductEqualityComparer());
public class ProductEqualityComparer : IEqualityComparer<Product>{ public bool Equals(Product x, Product y) { // If reference same object including null then return true if (object.ReferenceEquals(x, y)) { return true; } // If one object null the return false if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null)) { return false; } // Compare properties for equality return (x.Reference == y.Reference && x.ProductVersion == y.ProductVersion); } public int GetHashCode(Product obj) { return obj.GetHashCode(); }}
If you run and debugs this code you will quickly see that the Distinct() operation does not return a collection of products that are distinct from each other according to the stated criteria for uniqueness. If you debug the code you will also see that the Equals() method of the comparer class never actually gets called. So what is happening?
The important method here is GetHashCode(). What is actually happening is that the Distinct() operation is first using the comparers implementation of GetHashCode() to first derive a hash-code for the Product. Only when a product returns like hash-codes is Equals() called.
This implementation returns a hash-code based on the object instance. Thus one based on the product values is needed:
public int GetHashCode(Product obj){ if (object.ReferenceEquals(obj, null)) { return 0; } int referencehHash = obj.Reference == null ? 0 : obj.Reference.GetHashCode(); int versionHash = obj.ProductVersion == null ? 0 : obj.ProductVersion.GetHashCode(); return referencehHash ^ versionHash;}
Of course this implementation assumes a couple of things. Firstly that equality comparisons of the ProductVersion return meaningful results, and secondly that the GetHashCode() returns a value based on the ProductVersion instance values; rather than the base implementation. For completeness here is the ProductVersion implementation which overrides the == operator:
public class ProductVersion{ public int VersionNumber { get; set; } public int SubVersionNumber { get; set; } public string VersionName { get; set; } public static bool operator ==(ProductVersion x, ProductVersion y) { // If reference same object including null then return true if (object.ReferenceEquals(x, y)) { return true; } // If one object null the return false if (object.ReferenceEquals(x, null) || object.ReferenceEquals(y, null)) { return false; } // Compare object property values return VersionsEqual(x, y); } public static bool operator !=(ProductVersion x, ProductVersion y) { return !(x == y); } public override bool Equals(object obj) { // If comparing to null return false if (object.ReferenceEquals(obj, null)) { return false; } // Cast and compare object property values ProductVersion version = (ProductVersion)obj; return VersionsEqual(this, version); } public override int GetHashCode() { int nameHash = this.VersionName == null ? 0 : this.VersionName.GetHashCode(); return nameHash ^ this.VersionNumber ^ this.SubVersionNumber; } private static bool VersionsEqual(ProductVersion x, ProductVersion y) { // Compare properties for equality return (x.VersionNumber == y.VersionNumber && x.SubVersionNumber == y.SubVersionNumber && x.VersionName == y.VersionName); }}
Remember, one has to use object.ReferenceEquals() calls for null checking when overriding the == operator
The IEqualityComparer thus firstly utilizes the product hash-code to place the product into a hast table. Only if a collision occurs is is the Equals() method called to determine uniqueness. You can easily see this in action if you code a GetHashCode() method that returns constant value for all products.
Once all this has been implemented the Distinct() methods works as expected, returning:
Product: AA, Version: A-1-0Product: AA, Version: A-1-1Product: AA, Version: A-2-0Product: AB, Version: B-1-0Product: AC, Version: C-1-0Product: AC, Version: C-1-1
In conclusion, distinct elements are found only when the hash-code are equal and the products with the same hash-code are equal based on the Equals() method call.
Hopefully one can see that this implementation is a lot more efficient than a process based on sorting the collection; and explains why the returned collection is in an unspecified order.
Written by Carl Nolan