[NBLUG/talk] Finding duplicate files
Ross Thomas
spamb8r at netscape.net
Mon Jul 7 16:06:01 PDT 2003
Eric Eisenhart wrote:
> On Mon, Jul 07, 2003 at 01:25:35AM -0700, Ross Thomas wrote:
>> Also handles embedded blanks and tabs. Misbehaves when new-lines are
>> embedded in a file name (sort & uniq aren't that sophisticated).
>
> Actually, the GNU sort has a "-z" option, equivalent to xargs' "-0" option.
While sort and uniq are the commands that will have the problem and
sort has the '-z' option, md5sum isn't capable of producing null
terminated output, which would defeat the '-z'.
> One problem here. "ln file1 file2" will create a duplicate that actually
> refers to the same file.
This may or may not be a problem, depends on the intent of the user.
For searching, hard-linked files are technically duplicates (even though
they refer to the same disk storage): You have two ways of referencing
the same file contents. A matter of semantics. This obviously breaks
down when you start changing file contents. :-(
However, in the original shell script you could replace the md5sum
command with an invocation of the following script, located in the
user's $PATH. You could also make the decision of which to invoke
by specifying an option and do a substitution based on that.
------------ Cut Here ----------------
#!/bin/sh
for i in "$@"
do
echo -n "`cat \"$i\" | md5sum | cut -c1-34 `"
ls -1i "$i"
done
------------ Cut Here ----------------
HTH.
Ross.
More information about the talk
mailing list