GitHub - rlp1938/Duplicates: Find duplicated files and list them to stdout. During search output broken symlinks to stderr.

rlp1938 / Duplicates Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Find duplicated files and list them to stdout. During search output broken symlinks to stderr.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitignore		.gitignore
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
Makefile.am		Makefile.am
Makefile.in		Makefile.in
NEWS		NEWS
README		README
aclocal.m4		aclocal.m4
cleanuputils.7		cleanuputils.7
compile		compile
config.h.in		config.h.in
configure		configure
configure.ac		configure.ac
depcomp		depcomp
duplicates.1		duplicates.1
duplicates.c		duplicates.c
excludes.conf		excludes.conf
fileops.c		fileops.c
fileops.h		fileops.h
firstrun.c		firstrun.c
firstrun.h		firstrun.h
install-sh		install-sh
md5.c		md5.c
md5.h		md5.h
missing		missing
mkinstalldirs		mkinstalldirs
processdups.1		processdups.1
processdups.c		processdups.c
unlocked-io.h		unlocked-io.h

Repository files navigation

Readme for program duplicates.

The aim of the program is to report files which are duplicated, ie
copied. To attain that aim the pathnames to all files under the user
input dir or dirs are recorded along with the size in bytes and inode
number. This file is then sorted on size, inode then pathname. That
file is then processed so that, any file with a unique size is
discarded. Then for any group of files having the same size and inode
number, all but the last one is discarded. That group necessarily are
all linked, hard linked or symlinked, it does not matter. At that poiint
files of the same size but different inodes are checked to see if any
byte differs within the first 128 Kbytes. Obviously if they diifer a
files is discarded. If not then the pair of files are md5summed. If the
hashes match then both files are recorded as being duplicates, otherwise
the first file is discarded.

The result so far is in order of md5sum, inode, then path. This is of
very little use to a human, so the groups of duplicates are grouped
by the path name of the first item in the group. That is then output to
stdout. If you redirect this to a file, such file may be used as input
to 'processdups' for manual processing.

I copied the md5.c, md5.h and other requirements from GNU coreutils-8.21
The necessary files were available after ./configure && make on that
package. I had tried libmhash but it took about 20% to 30% longer to
hash a file just ander 1 gig.