Duplicate Files 2

Duplicate Files 2

Rate This
  • Comments 6

A long time ago I posted a filter (AddNote) for adding notes to objects.  Some time later I posted a function (Get-MD5) for calculating the MD5 hash of a file and somebody asked how that could be used in a script to list all the files in a given folder that are very likely the same.  I like that question because the answer it allows me to combine both these functions in a way I find pretty neat.  First of all, lets create another filter called AttachMD5.

 

filter AttachMD5

{

  $md5hash = Get-MD5 $_;

  return ($_ | AddNote MD5 $md5Hash);

}

 

The filter expects to get a [System.IO.FileInfo] object via the pipeline.  It will calculate its MD5 hash, use the AddNote function to add the hash as a note called MD5 and finally it will return the object.

 

MSH>$foo = dir test.txt | AttachMD5

MSH>$foo.MD5

216 129 182 155 10 202 51 188 245 219 199 220 92 68 140 194

MSH>

 

Now we have all the pieces we need to write a script that will tell us if there are any files that are very likely duplicates.  The plan is to get a list of fileinfo objects, attach the MD5 to each one, then group by length and MD5 and finally print out all the groups that have more than one item.  Here is one way to do that:

 

$input |

  where { $_ -is [System.IO.FileInfo] } |

  AttachMD5 |

  group-object Length,MD5 |

  where { $_.Count -gt 1 } | 

  foreach { "$($_.Group | foreach { $_.FullName } )" }

 

 

Take that bit and copy it into a script along with the other functions and filters and lets try it out.

 

MSH>"abc" > a.txt

MSH>"xyz" > b.txt

MSH>"abc" > c.txt

MSH>"xyz" > d.txt

MSH>"jkl" > e.txt

MSH>"abc" > f.txt

MSH>dir | c:\monad\getdups.msh

C:\temp\a.txt C:\temp\c.txt C:\temp\f.txt

C:\temp\b.txt C:\temp\d.txt

MSH>

 

 

If we wanted to find all the very likely duplicate files in a directory structure we could just recurse through it and pipe it to the script:

 

MSH>dir . -recurse | c:\monad\getdups.msh

 

Now… you should know that this script isn’t exactly the most performant thing in the world.  After all, it’s calculating the MD5 hash for all the files which isn't really neccesary.  I’ll leave improving the performance as an exercise for you guys.  One quick way to improve performance would be to group via Length first, discard all those groups that donn’t have more than 1 and only then calculate the MD5.  Want to measure if you are really improving performance?  Give the time-expression cmdlet a try.

 

MSH>time-expression { dir | getdups.msh }

[Edit: Monad has now been renamed to Windows PowerShell. This script or discussion may require slight adjustments before it applies directly to newer builds.]

Leave a Comment
  • Please add 5 and 6 and type the answer here:
  • Post
  • finding duplicate files based on lastwritetime and length.

    get-childitem -force -recurse -erroraction continue | sort-object lastwritetime,length | Group-Object -Property lastwritetime,Length | ? { $_.count -gt 1 } | foreach { $_.group | select-object directory,name,length,lastwritetime } | export-csv files.csv

  • Now if we could only find out why powershell chokes when asked to just do it's job and run the script:

    M:\document\WindowsPowerShell>dir

    Volume in drive M is Data

    Volume Serial Number is 0CE7-363C

    Directory of M:\document\WindowsPowerShell

    17/10/2007  04:25 PM    <DIR>          .

    17/10/2007  04:25 PM    <DIR>          ..

    17/10/2007  04:25 PM               645 Get Duplicate Files.ps1

    17/10/2007  03:55 PM               716 Microsoft.PowerShell_profile.ps1

                  2 File(s)          1,361 bytes

                  2 Dir(s)  131,402,600,448 bytes free

    M:\document\WindowsPowerShell>powershell.exe Get_Duplicate_Files.ps1

    The term 'Get_Duplicate_Files.ps1' is not recognized as a cmdlet, function, operable program, or script file. Verify the term and try again.

    At line:1 char:23

    + Get_Duplicate_Files.ps1 <<<<

  • The problem is how you called the PS1 script. I presume that you are calling powershell from cmd.exe. In this case you have to do this:

    M:\document\WindowsPowerShell>powershell.exe ./Get_Duplicate_Files.ps1

    You missed the "./"

  • how to find a duplicate files in a remote system based on hash.

    senario : dump of softwares in a files server are copied to local computers and changes the names or extenction of the files. now how can i find the dump of a similar file in a remote system based on hash of a file existing in my file server. using powershell. ?

  • You can use DuplicateFilesDeleter, it is fast way to find and delete all duplicate files :)

  • You can use DuplicateFilesDeleter, it is fast way to find and delete all duplicate files :)

Page 1 of 1 (6 items)