Blog - Title

Find Duplicates using LINQ

Find Duplicates using LINQ

  • Comments 9

Sometimes you need to find the duplicates in a list.  I’m currently developing a little utility that tests code in a word processing document (now posted, you can find it here).  Each code snippet in the document has an identifier, and one of the rules that I’m imposing on this code testing utility is that there should be no duplicate identifiers in the set of documents that contain code snippets to be tested.  An easy way to find duplicates is to write a query that groups by the identifier, and then filter for groups that have more than one member.  In the following example, we want to know that 4 and 3 are duplicates:

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC

int[] listOfItems = new[] { 4, 2, 3, 1, 6, 4, 3 };
var duplicates = listOfItems
    .GroupBy(i => i)
    .Where(g => g.Count() > 1)
    .Select(g => g.Key);
foreach (var d in duplicates)
    Console.WriteLine(d);
 

This produces the following:

4
3

Leave a Comment
  • Please add 4 and 6 and type the answer here:
  • Post
  • Hi Eric,

    I am looking to remove duplicates when loading XML into a Dictionary. I have acheived it using the following:

    counties = XDocument.Load(HttpContext.Current.Server.MapPath("/data/counties.xml"))

    .Descendants("Row")

    .GroupBy(i => i.Attribute("Code").Value)

    .Where(g => g.Count() == 1)

    .Select(g => new { Code = g.Key, Desc = g.First().Attribute("Descrip").Value })

    .ToDictionary(x => x.Code, x => x.Desc);

    So I remove the duplicates by grouping by the "Code" attribute, allowing only groups with one element.

    Is there a more elegant way?

  • Actually, the above codes doesn't work as I want it to: .Where(g => g.Count() == 1) means that if duplicates exist both entries are removed, not just the "copies".

  • Hi Daniel,

    I think that you can just remove the Where, and it will work.  You would then be adding one entry into the dictionary for each unique Key.  You also don't need the Select - you can write the ToDictionary like this:

    .ToDictionary(g.Key, g.First().Attribute("Descrip").Value)

    -Eric

  • Hi Eric,

    I am looking to do something similar to your example but I have my items in 2 collections:

    string[] x = new string[] { "firstPath\a.txt", "firstPath\b.txt", "firstPath\c.txt", "firstPath\d.txt" };

    string[] y = new string[] { "secondPath\a.txt", "secondPath\e.txt", "secondPath\f.txt", "secondPath\g.txt" };

    I want to end up with the results:

    { "firstPath\a.txt", "secondPath\e.txt", "secondPath\f.txt", "secondPath\g.txt" }

    I've tried different combinations of Except(), Intersect(), and Union() with Lambdas but just can't seem to get the right results.

    Any assistance is greatly appreciated!

  • Hi Dave,

    You could do something like this:

    // uniqueness is based on the 'BaseName' so here is a function to get it

    static string BaseName(string path) { return path.Split('\\').ElementAt(1); }

    static void Main(string[] args)

    {

       string[] x = new string[] { @"firstPath\a.txt", @"firstPath\b.txt",

           @"firstPath\c.txt", @"firstPath\d.txt" };

       string[] y = new string[] { @"secondPath\a.txt", @"secondPath\e.txt",

           @"secondPath\f.txt", @"secondPath\g.txt" };

       // find all elements in x that are also in y

       var x1 = x.Where(p => y.Select(z => BaseName(z)).Contains(BaseName(p)));

       // find all elements in y that are not in x

       var y1 = y.Where(p => !x1.Select(z => BaseName(z)).Contains(BaseName(p)));

       // concatenate for complete collection

       var all = x1.Concat(y1);

       foreach (var z in all)

           Console.WriteLine(z);

    }

    You could do optimizations by materializing into intermediate arrays - should be done based on need and your real-world data.

    -Eric

  • Hi Eric,

    Thank you so much for your help and quick response!  It works perfectly!

    I have a few followup questions (just for my own learning):

    1. Though the end result is the same, is there any reason one would use Concat() instead of Union() in this case?  Note that order is of no importance here.

    2. Is there any way to do this as I was originally to do using any of Intersect(), Except(), or Where()?

    3. What is "Best Practice" - using the strongly-typed generic versions of LINQ methods or the non-generic (e.g. Select() vs. Select<TSource, TResult>())?

    Thanks again for your time - this is extremely helpful!

  • Hi Dave,

    >> Though the end result is the same, is there any reason one would use Concat() instead of Union() in this case?  Note that order is of no importance here.

    Concat will perform better than Union, which must check to see whether there are duplicates.  Concat will be lazy.  Union must iterate through all items in the source collection, determine and remove duplicates, and then yield up the result collection.

    >> Is there any way to do this as I was originally to do using any of Intersect(), Except(), or Where()?

    Problem is, you are not really intersecting sets.  Your rules are: if the basename exists in the first set, take it.  Then take all items from the second list where the basename isn't in the first set.  If you don't care which list the items in the result come from, then you could use Intersect to find any items in lists that are also in other lists, and you could use Except to include items from each source list that don't exist in other lists.  Alternatively, you could keep around a 'priority' to indicate which list the full path should come from.  One approach to using those methods would be to make your own equality comparer.

    >> What is "Best Practice" - using the strongly-typed generic versions of LINQ methods or the non-generic (e.g. Select() vs. Select<TSource, TResult>())?

    Actually, in both cases, you are using the same Select<TSource, TResult> method.  When you don't specify the type parameters, then the C# compiler infers those types.  There are a number of places where the C# compiler infers types - using the var keyword to declare a variable, or using a lambda expression where you don't specify the types of the arguments to the lambda.  This is another place where the compiler does type inference.

    -Eric

  • Thanks this little code snippet saved me a lot of manual labor.

  • Is there any short method to find duplicates key in any row or colomn ???

Page 1 of 1 (9 items)