Skip to content

Find duplicated files and list them to stdout. During search output broken symlinks to stderr.

License

Notifications You must be signed in to change notification settings

rlp1938/Duplicates

Repository files navigation

Readme for program duplicates.

The aim of the program is to report files which are duplicated, ie
copied. To attain that aim the pathnames to all files under the user
input dir or dirs are recorded along with the size in bytes and inode
number. This file is then sorted on size, inode then pathname. That
file is then processed so that, any file with a unique size is
discarded. Then for any group of files having the same size and inode
number, all but the last one is discarded. That group necessarily are
all linked, hard linked or symlinked, it does not matter. At that poiint
files of the same size but different inodes are checked to see if any
byte differs within the first 128 Kbytes. Obviously if they diifer a
files is discarded. If not then the pair of files are md5summed. If the
hashes match then both files are recorded as being duplicates, otherwise
the first file is discarded.

The result so far is in order of md5sum, inode, then path. This is of
very little use to a human, so the groups of duplicates are grouped
by the path name of the first item in the group. That is then output to
stdout. If you redirect this to a file, such file may be used as input
to 'processdups' for manual processing.

I copied the md5.c, md5.h and other requirements from GNU coreutils-8.21
The necessary files were available after ./configure && make on that
package. I had tried libmhash but it took about 20% to 30% longer to
hash a file just ander 1 gig.

About

Find duplicated files and list them to stdout. During search output broken symlinks to stderr.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published