Skip to content

refresh-bio/ORCOM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORCOM

ORCOM - Overlapping Reads COmpression with Minimizers is a set of 2 tools designed for an effective and high-performance compression of DNA sequencing reads, which consist of:

  • orcom_bin - performing a DNA reads clusterization into bins,
  • orcom_pack - performing compression of the DNA reads stored in bins.

Currently, ORCOM works with FASTQ files and supports only compression of DNA stream, discarding the read names and qualities. By compressing only the DNA stream, it can reduce the H. sapiens reads (ERA015743) composed of 1.34G reads of lengths 100–102 from 136GB to 5.5GB, achieving compression ratio of 0.327 bits per base.

For more information please check out the official website.

Building

Travis Status

Pre-built binaries

Pre-built binaries for Linux platform can be downloaded from the official website.

Build prerequisites

ORCOM currently provides Makefiles for building on Linux platform, however it should also be able to be compiled on Windows and/or MacOSX platforms. The only one prequisite is the availability of the zlib library in the system.

ORCOM binaries can be compiled in two ways - depending on the selection of multithreading support library, where for each a different makefile file is provided in subproject directories. In the first (default) case, threading support from C++11 standard will be used (g++ version >= 4.8). In the second case the boost::threads library will be used, which is needed to be present on the build system.

By default, binaries are compiled using g++, however compiling using Clang or Intel icpc should be also possible.

Compiling

To compile ORCOM using g++ >= 4.8 with C++11 standard and dynamic linking, in the main directory type:

make

To compile ORCOM using boost::threads with static linking:

make boost

The resulting orcom_bin and orcom_pack binaries will be placed in bin subdirectory.

However, to compile each subprogram separately, use the makefile files provided in each of subprograms directory.

Usage

DNA stream compression using ORCOM is a 2 stage process, consisting of running orcom_bin and orcom_pack subprograms in chain. However, to decompress the DNA stream, only running orcom_pack is needed.

orcom_bin

orcom_bin performs DNA records clustering into separate bins representing signatures. As an input it takes a single or a set of FASTQ files and stores the output to two separate files with binned records: *.bdna file, containing encoded DNA stream, and *.bmeta file, containing archive meta-information.

Command line

orcom_bin is run from the command prompt:

orcom_bin <e|d> [options]

in one of the two modes:

  • e - encoding,
  • d - decoding,

with available options:

  • -i<file> - input file,
  • -f"<f1> <f2> ... <fn>" - input file list,
  • -g - input compressed in .gz format,
  • -o<f> - output files prefix,
  • -p<n> - signature length, default: 8,
  • -s<n> - skip-zone length, default: 12,
  • -b<n> - FASTQ input buffer size (in MB), default: 256,
  • -t<n> - worker threads number, default: 8,
  • -v - verbose mode, default: false.

The parameters -p<value> and -s<value> concern the records clusterization process and signature selection. The parameter -b<value> concern the bins sizes before and after clusterization — the FASTQ buffer size should be set as large as possible in order to achieve best ratio (at the cost of large memory consumption). The parameter -t<value> sets total number of processing threads (not including two I/O threads).

Examples

Encode (cluster) reads from NA19238.fastq file, using signtature length of 6 and skip-zone length of 6, 4 processing threads with 256 MB FASTQ block buffer, saving output to NA19238.bin bin files:

orcom_bin e -iNA19238.fastq -oNA19238.bin -t4 -b256 -p6 -s6 

Encode reads from NA19238_1.fastq and NA19238_2.fastq files saving output to NA19238.bin bin files:

orcom_bin e -f”NA19238_1.fastq NA19238_2.fastq” -oNA19238.bin 

Encode reads from gzip-compressed FASTQ files (-g) in the current directory using 8 processing threads and saving the output to NA19238.bin bin files:

orcom_bin e -f”$( ls *.fastq.gz )” -oNA19238.bin -g -t8

Decode reads from NA19238.bin bin files and save the DNA reads to NA19238.dna file.

orcom_bin d -iNA19238.bin -oNA19238.dna

orcom_pack

orcom_pack performs DNA records compression. As an input it takes files produced by orcom_bin: *.bdna and *.bmeta and it generates two output files: *.cdna file, containing compressed streams and *.cmeta file, containing archive meta-information.

Command line

orcom_pack is run from the command prompt:

orcom_pack <e|d> [options]

in one of the two modes:

  • e - encoding,
  • d - decoding,

with available options:

  • -i<file> - orcom_bin generated bin files prefix,
  • -o<file> - output files prefix,
  • -e<n> - encode threshold value, default: 0 (0 - auto),
  • -m<n> - mismatch cost, default: 2,
  • -s<n> - insert cost, default: 1,
  • -t<n> - threads count, default: 8,
  • -v - verbose mode, default: false.

The parameters -e<value>, -m<value> and -s<value> concern the records internal encoding step, where encoding threshold value should be adapted to the dataset records’ length. The parameter -t<value> sets total number of processing threads (not including two I/O threads).

Examples

Encode (compress) clustered reads from NA19238.bin bin files using 4 processing threads and save the output to NA19238.orcom archive files:

orcom_pack e -iNA19238.bin -oNA19238.orcom -t4

Encode clustered reads from NA19238.bin bin files setting the read matching parameters of insert const to 2, mismatch cost to 1 and encoding threshold to 40, and saving the result to NA19238.orcom archive files:

orcom_pack e -iNA19238.bin -oNA19238.orcom -s2 -m1 -e40 

Decode (decompress) reads from NA19238.orcom archive saving the DNA reads to NA19238.dna file:

orcom_pack d -iNA19238.orcom -oNA19238.dna

Citing

Grabowski, Sz., Deorowicz, S., Roguski, L. (2014) Disk-based compression of data from genome sequencing, Bioinformatics, 31:1389–1395

About

Overlapping Reads COmpression with Minimizers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published