Skip to content

Dupsanalyzer is a tool for collecting duplicate statistics for real datasets

License

GPL-2.0, GPL-2.0 licenses found

Licenses found

GPL-2.0
LICENSE
GPL-2.0
COPYING
Notifications You must be signed in to change notification settings

jtpaulo/dupsanalyzer

Repository files navigation

Duplicates analyzer

Dupsanalyzer is a tool that can be used for processing a specific real dataset and extracting statistics regarding duplicate content. The tool is based on DEDISgen code written by me and on the Rabin fingerprinting code written by Joel Lawrence Tucci.

The main novelty of Dupsanalyzer over DEDISgen is that it also allows analyzing data with variable sized fingerprints (Rabin fingerprints). On the other hand, if you intend to generate a distribution for running with DEDISbench (a realistic benchmark for deduplication systems) you should use DEDISgen, because DEDISbench is designed for fixed-size block distributions.

Rabin fingerprints expected size is defined by the user. Moreover, generated fingerprints are forced to have a minimum size and maximum size, respectively, half or the double of the size defined by the user.

For more information regarding DEDISgen you may read these two published papers:

  • "DEDISbench: A Benchmark for Deduplicated Storage Systems"
  • "Towards an Accurate Evaluation of Deduplicated Storage Systems" or visit the project at:
  • https://github.com/jtpaulo/dedisbench

For more information about the Rabin Fingerprints you may visit:

Running Dupsanalyzer

-f or -d Find duplicates in folder -f or in a Disk Device -d

-p Path for the folder or disk device

-b or/and -r Size of fixed blocks to analyze in bytes or/and size of Rabin blocks to analyze in bytes

Optional Parameters

-z Path for the folder where duplicates databases are created default: ./gendbs/ . duplicate_db is the database with duplicates information hash->number_dups

-b Size of fixed blocks to analyze in bytes eg: -b1024,4096,8192

-r Size of Rabin blocks to analyze in bytes eg: -b1024,4096,8192

Examples

generate duplicate statistics for files in folder /dir/duplicates/ and its subfolders. Use Rabin fingerprints with a size of 4096 and Fixed-size blocks with a size of 1024 and 4096 bytes.

./Dupsanalyzer -f -p/dir/duplicates/ -o/dir/dist_file -r1024 -b1024,4096

generate duplicate distribution file /dir/dist_file for device /path/device. Analyze for a size of 4096 . A file dist_file4096 and dist_file8192 will be created.

./Dupsanalyzer -d -p/path/device -o/dir/dist_file -r4096

For more information please contact: Joao Paulo jtpaulo at di.uminho.pt

About

Dupsanalyzer is a tool for collecting duplicate statistics for real datasets

Resources

License

GPL-2.0, GPL-2.0 licenses found

Licenses found

GPL-2.0
LICENSE
GPL-2.0
COPYING

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published