Track duplicate files and merge them as hardlinks
undup is like a number of other command-line utilities that is intended to find duplicates in a given set of files and replace them with hardlinks to save space.
There are plenty of these types of utilities. I originally wrote this in PHP because I need to run it on a NAS that provided a PHP interpreter. In order to achieve a decent performance this cached file hashes. Because the size, of the data increased, I wanted to improve the performance of this code by re-writting it in C.
Its only unique feature (that distinguishes it from the other alternatives referred here) is the use of file to cache hashes.
undup is distributed in source form and a tarball containing a statically linked executable. The recommended way to install undup is to compile from source code.
To compile undup you need a gcc-4.8 compiler. I have only tested building it with :
Download the source from GitHub and enter:
make prod
For static build:
make EMBED_GDBM=1 LXFLAGS=-static prod
Then you can copy the resulting undup
binary files to the
appropriate location (usually /usr/bin
). You can also copy the
provided undup.1
man page to the appropriate location.
I also the include the XBUILD
script to create an executable for the
ARM architecture which I use in my NAS device. Customize this file
to taste.
- EMBED_GDBM=1 : Link with an embedded gdbm library
- LXFLAGS=-static : Build static binary
The man page can be found here.
To get a help page:
[alex@pc3 undup]$ ./undup -h
Usage: ./undup [options] dir
-5: use MD5 hashes (default)
-C: disable hash caching
-c catalogue: create a file catalogue
-e: execute (disables dry-run mode)
-H algo : Select hash algo
-h|?: this help message
-I pattern: include pattern
-K: shows cache stats
-l lockfile: create a exclusive lock
-m: shows memory stats
-q: supress additional info
-S: use SHA256 hashes
-s: scan only
-V: version info
-v: show additional info
-X pattern: exclude pattern
Scan file system:
[alex@pc3 undup]$ ./undup -s ./data/test1
Using hash: md5
[SCAN-ONLY] Scanning ./data/test1
Files found: 21675
Size clusters found: 5025
Scan file system, showing memory statistics:
[alex@pc3 undup]$ ./undup -s -m ./data/test1
Using hash: md5
[SCAN-ONLY] Scanning ./data/test1
Files found: 21675
Size clusters found: 5025
De-duplicate a file system, using SHA256 checksums and showing cache stats:
[alex@pc3 undup]$ ./undup -e -K -S ./data/test1
Using hash: sha256
Scanning ./data/test1
Files found: 21675
Size clusters found: 5025
Files de-duped: 14553
Storage freed: 1.2G
Hash cache: 0:2429 (0% hit ratio)
More verbose output:
[alex@pc3 undup]$ ./undup -v -K -S ./data/test1/lb1
Using hash: sha256
Scanning ./data/test1/lb1
Files found: 9
Size clusters found: 3
Files de-duped: 3
Storage freed: 376.0K
- (180117f0):u=2001 g=1000 m=664 (323K):uthash-master.zip
-> (2009570b):raro
-> (2009570b):c/uthash-master.zip
- (180117f2):u=2001 g=1000 m=664 (52K):crypto-algorithms-master.zip
-> (2009570a):c/crypto-algorithms-master.zip
This software makes use of the following libraries:
- uthash - in memory hash table. MIT style license.
- cu - C unit test library. BSD Licensed.
- crypto-algorithms - Basic implementations of standard cryptography algorithms, like AES, MD5, and SHA-1. Public domain.
- GDBM - library of
database functions to create and manipulate a hashed database.
Licensed as GPLv3. The system gdbm library will be used unless
EMBED_GDBM=1 is used during build. This is the default for
XBUILD
. - human_readable.c from busybox - GPLv2 Licensed.
Few alternatives:
This project makes use of github actions. You can find the build status page here.
Release steps:
- Update Change log in
README.md
- Run:
./ghrelease -g x.y.z
- 2.2.4: Infra changes
- switching from travis to github actions.
- 2.2.3: Minor update
- updated documentation
- Added option '-P' to show process statistics (for debugging and VM sizing)
- hcache is inside the directory (better if the directory is a mounted filesystem)
- 2.2.2: Minor updates
- compile with musl
- add option to build static binary
- inode in catalogue is unsigned
- removing alpine releases in favor for a single static release
- 2.2.1: Maintenance release
- Code clean-ups
- Updating
alpine linux
release to v3.8.
- 2.2.0:
- Added excludes/includes options
- Added option -H to select all hashes
- 2.1.1:
- Fixing tests
- Updating libraries
- 2.1.0:
- Updated Makefile so it is also possible to build using the current
os libgdbm instead of the embedded copy using:
- make GDBM_UNPACK=: GDBM_DEP= GDBM_REF=-lgdbm prod
- Upgrading gdbm to v1.12
- Updated Makefile so it is also possible to build using the current
os libgdbm instead of the embedded copy using:
- 2.0.2:
- Automatically update version display
- Some documentation clarification
- 2.0.1: Minor updates
- Added error checks
- updated documentation
- 2.0.0: C rewrite
- Re-wrote it in C
- 1.0.0: Initial PHP release
- Create a github actions workflow to create release artifacts
- Read and display /proc/self/status
- hcd file inside directory tree
- Fix mem leak
- Migrate to gh-actions
undup is licensed as GPLv2 software.
undup
Copyright (C) 2015, 2018 Alejandro Liu
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, see <http://www.gnu.org/licenses>