Welcome to MSDN Blogs Sign in | Join | Help

Tables Part 6: Removing duplicates from tables (and ranges) of data

Being able to remove duplicate rows of information from a table of data is a request we hear fairly often from our customers (and one of the top questions in the comments in this blog).  Some users know that this capability exists in Excel today; unfortunately it is buried under advanced filter settings and it’s not terribly easy to use.  So we set out in Excel 12 to build a better interface specifically for this task so that any user could easily remove unnecessary data from their spreadsheet.

Remove duplicates can be found in two places in Excel 12, on the Data ribbon as well as the Table ribbon (just like sort and filters, it’s not necessary to have a table in order to use this feature).  To use the feature, a user simply has to select the data they want to examine for duplicates and press the “Remove Duplicates” button.  This will bring up a dialog that looks like this:


(Click to enlarge)

You’ll notice that all my column headers appear in the dialog.  To remove duplicates, just select the columns that Excel should use to evaluate duplicates.  For example, in my table above I want to remove all duplicate rows where the first name is the same and the last name is the same.  In other words, if there is more than one row where FirstName = David and LastName = Gainer then the extra duplicate rows will be removed.  So my table which looked like this:


(Click to enlarge)

Now looks like the following after I remove duplicates.


(Click to enlarge)

Note that remove duplicates physically removes data from your spreadsheet.  It does not hide rows.  You can, of course, back out (undo) of remove duplicates if you make a mistake.  If you wanted to first take a look at the duplicate values, you could use the new “Highlight Duplicate Values” feature that we have added to conditional formatting to do so.

In my next post I will finish up table-specific features by reviewing the work we did with table styles.

Published Friday, November 04, 2005 3:13 PM by David Gainer
Filed under:

Comments

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 7:10 PM by Harlan Grove
In current versions when one uses advanced filters to remove duplicates, all fields in the result (so all fields in the source range if filtered in place) are included when checking for duplicates. Will XL12 default to all fields selected in the new dialog?

Also, using advanced filters, one has the option of copying the filtered data to a different location. Will that be an option in XL12 using the new UI?

As for removing duplicates, presumably this is top-to-bottom, so the topmost record of possibly many duplicates remains and subsequent duplicate records deleted.

One thing that screws up unique record filtering is visually duplicate records that have different amounts of leading or trailing spaces in various fields. It's not always practical simply to remove all spaces, and Excel currently provides no means other than the TRIM function and macros to remove only leading/trailing spaces. It'd be VERY USEFUL if this new feature included a checkbox to ignore (or not) leading and trailing spaces (and define space liberally as either characters with decimal codes 32 or 160).

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 7:53 PM by David Gainer
Hi - yes, all fields selected by default. The feature removes data in place. Good feedback on the spaces - as the feature currently sits, you would need to use the extra column.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 8:10 PM by Rob van Gelder
Nice work on making this quicker.
Unique Advanced Filter is something I use a lot - and, you're right - there are a lot of clicks to get there.

The ability to select just some columns seems, at first, like a useful feature. Once I started thinking, I dont think I'll use it much.
The problem is that some of your checkout times are lost.
I cant think of a time where I've wanted to destroy remaining values but just keep the first.

It would be more useful if there were options for what to do with the remaining columns.
Such as:
Use first, use last, sum, min, max, average, join with comma separator, join with <user defined separator>

In your example, I might have wanted to keep the min(checkout) and max(checkin).


Regarding Harlan's comments (are you renaming to the David and Harlan blog yet? ;)

Agree there should be a "copy filtered to" feature.

On the autotrim checkbox:
I would not like to see that feature. In my experience, the users who would use that feature are the very same who let untrimmed strings slide in their raw data - which leads to all sorts of integrity issues later.
Better to have those features in a 'data cleaning' window which concentrates such features. eg. trim spaces, Proper Case, alphanum only, text stored numbers to real numbers, etc...


Nice work - I cant wait to try the beta...

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 8:25 PM by Jean Martineau
Thank’s for the remove duplicates improvement.

I agree with Harlan that we need to keep
|the option of copying the filtered data to a
|different location.
In this case, I also need to keep an option to specify a Criteria range.

Jean

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 9:51 PM by Dave Solimini
wow. it would be difficult for me to express how incredibly happy i am to see this feature improved! You folks look to be doing some amazing work. Hats-off!

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Friday, November 04, 2005 10:40 PM by Ken Clifton
I too would like to see options to move the data to a different location, sheet, file, etc.

Also have an option for each field you could pick keep first, last, Nth position.

Also be able to pick what each field will do like MIN, MAX, AVG, COUNT, SUM, OLDEST, NEWEST, etc.

That way if you had an employee with data you could average some columns, sum others, get the MIN, MAX what ever you might need with this one feature. This would be a GREAT selling point.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Saturday, November 05, 2005 1:04 AM by Biff
I'll echo what many have already noted:

Copying filtered data to another location is very useful, especially the unique values.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Saturday, November 05, 2005 1:06 AM by Orion Adrian
I love the data cleaning idea. Stuff like trim, change case, reformat, remove charactes (like ()$.-_, etc.) would be very useful ot me.

Orion Adrian

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Saturday, November 05, 2005 9:27 AM by Roy
All the tables in your screenshots look great! I hope there's a new easy way to format them like that.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Sunday, November 06, 2005 12:17 PM by Harlan Grove
Rob van Gelder...
...
|Better to have those features in a 'data
|cleaning' window which concentrates such
|features. eg. trim spaces, Proper Case, alphanum
|only, text stored numbers to real numbers,
|etc...
...

I agree. If the goal of all these changes is the greatest good for the greatest number of users for the fewest keystrokes or mouse clicks, then someone on the development team need to realize that a very great deal of the data stored and manipulated in spreadsheets is text rather than numbers, and Excel provides very few tools for manipulating text that don't require using formulas in worksheet cells.

With respect to autotrim, it kinda exists already in Data > Text to Columns, which can be run using fixed width fields but only one field. Excel will strip off leading and trailing ACSII spaces (decimal character code 32) but not nonbreaking HTML spaces (decimal character code 160). This is pretty obscure.

There are also frequent questions about converting ranges to all upper case or proper case. The only practical approach currently is writing UPPER or PROPER formulas elsewhere, pasting their values on top of the original range, and clearing the range of formulas. A lot of work for most users.

If it's a good idea to relieve users of the complexity of using formulas in conditional formatting, wouldn't it also be a good idea to relieve them ot the complexity of using formulas in data cleansing?

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Monday, November 07, 2005 1:11 AM by Arun Philip
It would be nice if there's some way of highlighting the values and then getting a user approval prior to removing the duplicates. Basically, it would combine the functionality of Conditional Formatting's Highlight Duplicate Values, with the Remove Duplicates Feature.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Monday, November 07, 2005 1:38 PM by headtoadie
Sometimes it is the duplicates that we *really* want and not the uniques.

For example, if I have a customer history file and I want to identify the repeat customers then it is the duplicates that I'm really after.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Monday, November 07, 2005 4:32 PM by David Gainer
Hello folks - again, thanks for the comments and suggestions from everybody.

headtoadie - you can use conditional formatting (in Excel 12) to highlight dupes and then filter on the conditional formatting, for a total of about 3-4 mouse clicks.

everyone -

...
|Better to have those features in a 'data
|cleaning' window which concentrates such
|features. eg. trim spaces, Proper Case, alphanum
|only, text stored numbers to real numbers,
|etc...
...
|I agree. If the goal of all these changes is
|the greatest good for the greatest number of
|users for the fewest keystrokes or mouse
|clicks, then someone on the development team
|need to realize that a very great deal of the
|data stored and manipulated in spreadsheets
|is text rather than numbers, and Excel
|provides very few tools for manipulating text
|that don't require using formulas in |worksheet cells.
...

What are the most common 'data clensing' tasks you find yourself doing?

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Monday, November 07, 2005 5:53 PM by Harlan Grove
David Gainer...
...
|What are the most common 'data clensing' tasks
|you find yourself doing?

Myself? I don't use Excel for data cleansing.

My impression of users asking about this in newsgroups or stating problems that would be resolved by data cleansing, deleting leading and trailing ASCII spaces, deleting nonbreaking spaces, changing letter case and converting text that looks numeric into numbers are the most frequent data cleansing/conditioning needs. All except changing letter case could be accomplished using Data > Text to Columns.

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Tuesday, November 08, 2005 4:10 AM by XL-Dennis
David,

I fully agree with Harlan Grove. Per se the data cleansing should be less in view of the XML-era and general development but it tend to be the opposite.

The keyissue is that many corporate central systems in use can create, as output, so called Excel-files (core textfiles with the xls-extension) which looks great on it's face but create a demand for cleaning up before anything can be done with the files. It also involve the issue with dot and comma as separators for values.

Of course we can argue that this is something that the system vendors should solve but usually this is not a keyissue for central IT-departments in their dialogs with the vendors. In the end it's up to to the end-users to solve it in Excel which is reflected in newsgroups and public forums.

Kind regards,
Dennis

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Tuesday, November 08, 2005 5:52 AM by Nigel Harper
Dave,

I think it is more important to ask why we end up with duplicates in the first place. XL-Dennis rightly points out that many text files created by exporting data from other apps and renaming with an xls extension often contain extra characters and confuse numeric text and numbers.

Additionally gadgets containing contacts to be imported into Outlook are a major bugbear. When I worked at a large investment bank, part of my duties were to be the in-house gadget guru. Almost every new banker arrived with his/her gizmo. This generally contained around 2500 contacts made up of generations of contacts from the various preceding gadget of the day. Each one, sometimes even each model from the same manufacturer, has different ways of storing 'multi-field' fields such as address, the synchronising software is not much better. Assistants would tear their hair out trying to filter and sort for duplicates that had built up over the years sometimes because of a linefeed or tab character etc. Inevitably it would come to me in the form of a spreadsheet to use text function formulas to sort things out.

Harlan is correct that much can be done with text to columns but there is more that it could do. To be able to evaluate rows by accounting for leading, trailing and instances of strings, automatic trimming and cleaning, numeric evaluation and case sensitivity WITHOUT actually having to change the contents would be very useful.

'Duplicates' is actually a bit of a misnomer, most of the time the reason there are duplicates is because the similar rows are not actually duplicates. Hope this helps,

Nigel

# re: Tables Part 6: Removing duplicates from tables (and ranges) of data

Tuesday, November 08, 2005 10:38 AM by headtoadie
On data cleaning...one area where we experience problems is with text strings that contain a single "E" (ex. 72E0050). Excel tends automatically convert these to scientific notation. Once it is converted, the only way to get it back to original format is to retype it.

# Revealed - Thousands of leads in 6 minutes and 37 seconds

New Comments to this post are disabled
 
Page view tracker