- implement the simplest FM-index using just basic data structures
- implement FM-index using wavelet trees
git clone --recursive https://github.com/iborko/fmindex
- run
make
- executables will be located in
bin
folder (bin/fmindex
,bin/test
)
Binary will be located in bin
folder (fmindex
). Usage:
fmindex <sequence> <reads> [<occurrence_table> [<bucket_size>]]
<sequence>
- path to the sequence on which the search will be made, FASTA format<reads>
- path to the reads that will be searched, FASTQ format<occurrence_table>
- can be0
(matrix occurrence table),1
(wavelet tree based occurrence table)<bucket_size>
- bucket size for bit string rank in wavelet tree, optional, default is 20
Example:
fmindex Esch_coli_536.fna Esch_coli_536_reads.fq 1 40
For every read from <reads>
program generates two lines. First line is the FASTQ header and the second line is the list of all position indices of the current read in <sequence>
. Indices are splitted with whitespaces.
Script test_run.sh
can be used to run program on one of the example sequence. Example sequences are located in the test_data
folder.
Example of run:
test_run.sh test_data/Esch_coli_536.fna
This script will generate sets of (1000, 5000, 10000, 50000, 100000, 500000, 1000000) reads and run the fmindex
on them, measuring CPU time and memory using time
program.
1 Jochen Singer: A Wavelet Tree Based FM-Index for Biological Sequences in SeqAn, link