GitHub - migumar2/libCSD: C++ Library implementing Compressed String Dictionaries

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
FMIndex		FMIndex
Hash		Hash
HuTucker		HuTucker
Huffman		Huffman
RePair		RePair
XBW		XBW
iterators		iterators
libcds		libcds
utils		utils
Build.cpp		Build.cpp
LICENSE.md		LICENSE.md
Makefile		Makefile
README		README
StringDictionary.cpp		StringDictionary.cpp
StringDictionary.h		StringDictionary.h
StringDictionaryFMINDEX.cpp		StringDictionaryFMINDEX.cpp
StringDictionaryFMINDEX.h		StringDictionaryFMINDEX.h
StringDictionaryHASHHF.cpp		StringDictionaryHASHHF.cpp
StringDictionaryHASHHF.h		StringDictionaryHASHHF.h
StringDictionaryHASHRPDAC.cpp		StringDictionaryHASHRPDAC.cpp
StringDictionaryHASHRPDAC.h		StringDictionaryHASHRPDAC.h
StringDictionaryHASHRPF.cpp		StringDictionaryHASHRPF.cpp
StringDictionaryHASHRPF.h		StringDictionaryHASHRPF.h
StringDictionaryHASHUFFDAC.cpp		StringDictionaryHASHUFFDAC.cpp
StringDictionaryHASHUFFDAC.h		StringDictionaryHASHUFFDAC.h
StringDictionaryHHTFC.cpp		StringDictionaryHHTFC.cpp
StringDictionaryHHTFC.h		StringDictionaryHHTFC.h
StringDictionaryHTFC.cpp		StringDictionaryHTFC.cpp
StringDictionaryHTFC.h		StringDictionaryHTFC.h
StringDictionaryPFC.cpp		StringDictionaryPFC.cpp
StringDictionaryPFC.h		StringDictionaryPFC.h
StringDictionaryRPDAC.cpp		StringDictionaryRPDAC.cpp
StringDictionaryRPDAC.h		StringDictionaryRPDAC.h
StringDictionaryRPFC.cpp		StringDictionaryRPFC.cpp
StringDictionaryRPFC.h		StringDictionaryRPFC.h
StringDictionaryRPHTFC.cpp		StringDictionaryRPHTFC.cpp
StringDictionaryRPHTFC.h		StringDictionaryRPHTFC.h
StringDictionaryXBW.cpp		StringDictionaryXBW.cpp
StringDictionaryXBW.h		StringDictionaryXBW.h
Test.cpp		Test.cpp

Repository files navigation

libCSD is a C++ library providing some different techniques for managing
string dictionaries in compressed space. These approaches are inspired on
the paper:

==========================================================================
  "Compressed String Dictionaries"
  Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, 
  Miguel A. Martínez-Prieto, and Gonzalo Navarro
  10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.
==========================================================================

The current release is a beta version providing full functionality for
string location and ID extraction. Besides, some techniques support queries
by prefix and substrings. The set of techniques is described as follows:

- "Hash":  implements four different techniques which combines hashing and 
	   Huffman / Re-Pair compression.
- "FC":    implements two different techniques based on Front-Coding 
	   compression. PFC is a straightforward byte-oriented implementation,
	   and HTFC performs Hu-Tucker compression over the headers while
	   supports Huffman / Re-Pair compression for the internal strings.
- "RPDAC": uses Re-Pair for compressing the strings and.
- "FMI"  : self-indexes the dictionary and provides all types of queries using
	   the FM-Index facilities.
- "XBW"  : obtains a compressed trie of the dictionary and transforms it to
	   support all types of queries in highly compressed space.

The techniques support different parameterizations to achieve varied and
competitive space/time tradeoffs. 

Compiling the library
====================
1) For building the libcds library, type "make" within the libcds folder.
2) Once built libcds, type "make" within the root folder and libCSD is ready.


Building a dictionary
=====================
The library provides a simple command-line script for building dictionaries:

./Build <type> <parameters> <in> <out>

  - The first parameter chooses the <type> of dictionary to be built. 
  - The set of <parameters> provide specific configuration values.
  - The <in> parameter locates the original dictionary file.
  - The <out> parameter tells the destination path.

It is worth noting that "in" dictionary file must be lexicographically sorted
and strings must be ended with the '\0' ASCII char.

Examples:
=========
./Build 1 h 10 geonames dicts/geo.10

  Builds a HASH dictionary for "geonames" and stores it as "dicts/geo.10". The 
  dictionary uses a hash table which hash 10% more cells than strings in 
  "geonames" and compresses them using Huffman (h).

./Build 4 r 16 geonames dicts/geo.16

  Builds a HTFC (Hu-Tucker Front-Coding) dictionary for "geonames" and stores 
  it as "dicts/geo.16". The dictionary uses buckets of 16 strings and 
  compresses them using Re-Pair.


Testing a dictionary
====================
The library also provides a command-line script for testing purposes:

./Test <mode> <opt> <in> <file>

- This script supports four different <modes>:

  - 'r' is used for running the test chosen in <opt>:
	 - 'l' (for testing locate), 'e' (extract).
	 - 'pl' (prefix location), 'pe' (prefix extraction).
	 - 'sl' (substring location), 'pe' (substring extraction). 
  - 'g' is used for generating a basic testbed comprising <opt> valid strings
	for locate and <opt> valid IDs for extract.
  - 'p' is used for generating a prefix testbed comprising 100,000 query 
	patterns of some different lengths. The <opt> parameter tells the
	a reference string length, and the testbed obtain different files
	with patterns longer and shorter than this length.
  - 's' is used for generating a substring testbed with the features described
	above.

- The parameter <in> locates the file storing the compressed string dictionary.
- The last <file> parameter locates the file comprising the testbed (for 'r'
  mode) or the destination path for saving the corresponding testbed (modes 
  'g', 'p', or 's').


Examples:
=========
./Test r l dicts/geo.10 tests/geo.strings

  Uses the testbed "tests/geo.strings" for analyzing how the dictionary stored 
  at "dicts/geo.10" behaves for 'locate'.

./Test g 100000 dicts/geo.10 tests/geo

  Uses the dictionary stored at "dicts/geo.10" for generating a basic testbed 
  of 100,000 strings and IDs which are stored as "tests/geo".

./Test p 16 dicts/geo.10 tests/geo

  Uses the dictionary stored at "dicts/geo.10" for generating a prefix testbed
  comprising patterns of lengths longer and shorter than 16 chars. The 
  resulting pattern set are stored at "tests/geo".


If you find bugs or have any issue with library, please ask us. Enjoy the 
library and if you find it useful for your research, please cite our paper:

@article{MPBCCN16,
  author    = "M.A. Mart{\'{\i}}nez-Prieto and N. Brisaboa and R. C{\'a}novas  
                and F. Claude and G. Navarro",
  title     = "Practical Compressed String Dictionaries",
  booktitle = "Information Systems",
  year      = "2016",
  volume    = {56},
  pages     = "73--108"
}