[NBLUG/talk] Finding duplicate files

Mon Jul 7 16:06:01 PDT 2003

Eric Eisenhart wrote:
> On Mon, Jul 07, 2003 at 01:25:35AM -0700, Ross Thomas wrote:
>> Also handles embedded blanks and tabs.  Misbehaves when new-lines are
>> embedded in a file name (sort & uniq aren't that sophisticated).
> 
> Actually, the GNU sort has a "-z" option, equivalent to xargs' "-0" option.

While sort and uniq are the commands that will have the problem and
sort has the '-z' option, md5sum isn't capable of producing null
terminated output, which would defeat the '-z'.

> One problem here.  "ln file1 file2" will create a duplicate that actually
> refers to the same file.

This may or may not be a problem, depends on the intent of the user.
For searching, hard-linked files are technically duplicates (even though
they refer to the same disk storage):  You have two ways of referencing
the same file contents.  A matter of semantics.  This obviously breaks
down when you start changing file contents.  :-(

However, in the original shell script you could replace the md5sum
command with an invocation of the following script, located in the
user's $PATH.  You could also make the decision of which to invoke
by specifying an option and do a substitution based on that.

------------ Cut Here ----------------
#!/bin/sh

for i in "$@"
do
    echo -n "`cat \"$i\" | md5sum | cut -c1-34 `"
    ls -1i "$i"
done

------------ Cut Here ----------------

HTH.

Ross.