Skip to content

An experimental platform for chunk-level data deduplication. Key words: DDFS, Sparse Index, Extreme Binning, SiLo, Sample Index, BLC; CBR, CFL, CAP, HBR; ASM, OPT; GC, Cumulus

License

mavslh/destor

 
 

Repository files navigation

General Information
===================
This is a prototype of a deduplication system, named destor.
The destor is an experimental tool that may help implementing great ideas.

Features
========
1. Context-based chunking by rabin hash, and fixed-sized chunking.
2. Chunk-level pipeline;
3. Container-based storage;
4. in-memory index and DDFS index;
5. LRU cache and Nearly-Optimal cache;
6. CFL monitor,  selective deduplication, and CFL-assisted HBR;
7. Context-based rewriting algorithm (CBR).
8. Extreme Binning index without file-level dedup;
9. SiLo index;
10. History-based rewriting algorithm (HBR), and CBR-assisted HBR.
11. Capping rewriting algorithm, rolling forward assembly area algorithm, and capping-assisted HBR.
12. Add a cache filter to improve rewriting algorithms.
13. Sparse Index.
14. Sample Index.
15. Segment Binning.
16. Garbage collection scheme of Cumulus.
17. Block Locality Caching index.

Related papers
==============
1. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System, @FAST'08.
2. Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup, @MASCOTS'09.
3. SiLo: A Similarity-Locality based Near-Exact Deduplicatin Scheme with Low RAM Overhead and High Throughput, @USENIX ATC'11.
4. Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage, @HPCC'11.
5. Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets, @MASCOTS'12. 
6. Reducing impact of data fragmentation caused by in-line deduplication, @SYSTOR'12.
7. Improving Restore Speed for Backup Systems that Use Inline Chunk-Based Deduplication, @FAST'13.
8. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality, @FAST'09.
9. Building a High-Performance Deduplication System, @USENIX ATC'11.
10. Cumulus: Filesystem Backup to the Cloud, @FAST'09
11. Block Locality Caching for Data Deduplication, @SYSTOR'13.

Environment
===========
Linux 64bit.

Dependencies
============
1. libssl-dev is required to calculate sha-1 digest;
2. GLib 2.32 or later version; 
3. libmysqlclient-dev is required;
4. Makefile is automatically generated by GNU autoconf and automake.

INSTALL
=======
If all dependencies are installed,
compiling destor is straightforward:

./configure
make
make install

To uninstall destor, run
make uninstall

Running
=======
If compile and install are successful, the executable file, destor, should have been moved to /usr/local/bin by default.
You can create a config file, destor.config, in where you run destor.
A sample destor.config is in scripts directory,
and many scripts for experiment are also there.
Run rebuild script to clean data.

destor has provided four types of services:
1. start a backup job,
    destor <directory or file>
2. start a recovery job,
    destor -r<jobid> <dest directory>
3. start a delete job,
    destor -d<jobid>
4. lookup statistics of system,
    destor -s
5. help
    destor -h
6. make a trace
    destor -t <path of raw files>

There are several options for destor.
1.backup options
    --index=[RAM|DDFS|EXBIN|SILO|SPARSE|SAMPLE]
    --rewrite=[NO|CFL|CBR|CAP|HBR]
2.restore options
    --cache=[LRU|ASM|OPT]

Configure
=========
The configure file of destor, destor.config, is a little complicated.
Actually it offers several parameters of deduplication system, such as index type(RAM, DDFS, etc.), cache type(LRU, OPT and ASM), rewriting algorithms(CFL, CBR, etc.) and so on.
1.INDEX_TYPE : indicates the type of the fingerprint index. Destor supports RAM, DDFS, EXBIN, SILO, SPARSE, SAMPLE currently;
2.READ_CACHE_TYPE : the cache for recovery, and destor now supports LRU, OPT and ASM;
3.REWRITE : indicates whether using rewriting algorithm, and destor supports NO, CFL, CBR, CAP, HBR etc.
4.There are many parameters for these algorithms, be careful.

Bugs
====
1. The path of the backup/restore directory cannot exceed 200 bytes. 
2. If the running destor is crashed artificially or unexpectedly, data consistency is not guaranted and you'd better run rebuild script.
3. Does NOT backup extremely large dataset with a huger number of chunk than 2^31.
4. Does NOT support concurrent backup/restore.
5. If working path in destor.config is modified, the rebuild script in scripts folder must be modified too.

Author
======
Min Fu,
Email : fumin@hust.edu.cn
blog : fumin.hustbackup.cn

About

An experimental platform for chunk-level data deduplication. Key words: DDFS, Sparse Index, Extreme Binning, SiLo, Sample Index, BLC; CBR, CFL, CAP, HBR; ASM, OPT; GC, Cumulus

Resources

License

Stars

Watchers

Forks

Packages

No packages published