rHAT

rHAT is a hash table base, fast sequence alignment algorithm

it first construct a hash table(call RHT) in order to find the read's seed in the reference DNA sequence

then do the extension process to get a alignment result

usage:

make

//help
./create_RHT -h
./run_rHAT -h


//use E.coli.fa file to create RHT
./create_RHT -d E.coli.fa


//-d for dna file, -r for read file, -o for ouput file
./run_rHAT -d E.coli.fa -r dna_read -o out

The rHAT program has two different parts.The first part is create_RHT. The second part is run_rHAT.

The first part create_RHT is aimed to use a reference DNA array to create an RHT table. The second part run_rHAT is for using the RHT table and a read to run rHAT process.

To run rHAT should set the pointerlistlen length and window length. They are store in PointerListLen(size_t [11]) and WindowListLen(size_t [2048])

##The main process of rHAT program

1.0 create_RHT process:
1.1 first of all, the DNA string is store in a std::string named dna_w;
1.2 for each char in dna_w, convert into a two bit numnber(A->00, C->01, G->10, T->11) and then append to a variant named tmp;
1.3 take the last 2*PointerListLen bit nums in tmp to indicate pointer indices;
1.4 take the dna_w index to count the window index;
1.5 use pointer index and window index to bind together and store into a array named PW(uint64_t);
1.6 sort(std::sort) the PW array and then separate it into a pointer lists and a window lists as the RHT table;
1.7 write the RHT table to a file called out_RHT;


2.0 run_rHAT process:
2.1 get the DNA and read arrays form file and store in dna_f(std::string) and read(std::string);
2.2 get the RHT table from out_RHT(see 1.7);
2.3 for each read, split the center string of read. Does this process by counting the center read len(size_t theLen, if read < WindowlistLen/2 then theLen=read len), the len from read's start postion to read-center's start postion(size_t len_up_stream) and the len from read-center's end postion to read's end position(size_t len_down_stream);
2.4 in read center array do 1.1 to 1.2 process, use pointer index and RHT table loaded before to find the window index;
2.5 create an array to count the each window index's appearing time of 2.4 process;
2.6 use a minimum heap to get the k highest hitted windew index in the array;
2.8 use the whole read array to create a read l-mer RHT table;
2.8.1 use read to do 1.2 and 1.3 process;
2.8.2 take the read index as the read RHT window index, and do 1.5 process;
2.8.3 sort the read PW array;
2.9 for each k highest hitted window index in 2.6 do create-DAG and do_alignment process;

3.0 DAG process:
3.1 use window index to get DNA window array of DNA string and extend upstream len_upstream(see 2.3) char and extend downstream(see 2.3) char to create DNA window array;
3.2 use DNA window arrays(3.1) to run 1.2 and 1.3 process and use the pointer index and read's RHT table to find the read hitted index;
3.3 set the DNA window index(3.2) and read hitted inde(3.2) and match length (default PointerListLen) to create in a directed acyclic graph;
3.4 if two point p1 and p2 in DAG where p1.index_w = p2.index_w-1, p1.index_r=p2.index_r-1 and p1.len=p2.len+1, there is no need to store p2 and as a result the p2 can be deleted;
3.5 add the start(a, b, 0)(a is the window start postion in 3.2->1.2 pocess and b is the read start postion in 2.8.1->1.2 process) and end(windowlen, readlen, 0) point to DAG;
3.6 for each point in DAG, if two of them satisfy the condition "VRi + VLi <= VRj <= VRi + VLi + Twait and VSi + VLi <= VSj"(more detail see [ref][1] 2.4 3), add an edge(in a 2D array) to link that two pointers;
3.7 use the point and edge information to find a path from start point to end point(see 3.5) that having the maximum score(see [ref][1] 2.4 4, by a dynamic programming process).
3.8 for the path(3.7) in DAG, the point in the path is the window&read array that matched, then the path between two points is the gap array of window&read, so cut the gap array of window&read to do alignment process(4.0);

4.0 do alignment process:
4.1 get the first and last gaps in window&read to do semi-global alignment using ksw;
4.2 get the remaining gaps to do global alignment process;

5.0 print the result:
5.1 for a read, there have k-highest hitted window indexes, for each hitted index, there have a max alignment score from 3.0 and 4.0. print the max score in that alignment and print out corebonding alignment;

##Main data structure used in create_RHT.cpp

class RHT to create RHT table(see 1.5)

uint32_t *P;
uint32_t *W;
uint64_t *PW;

PW[] is a 64 bit number array. for each unit in PW[], the first 32 bit is the pointer index and the last 32 bit is the window index in DNA string;(see 1.6) P[] is use to store first 32 bit info in PW[] W[] is th last 32 bit array in PW[];

##Main data structure used in run_RHT.cpp

in the process of getting the k-highest window hitted index, the counting process is work by a mini heap:

priority_queue<WINDOW_CNT>
the WINDOW_CNT is 
typedef struct window_cnt {
    uint32_t index_of_W;
    size_t cnt;
    bool operator<(const window_cnt& w) const
    {
        return cnt > w.cnt;
    }
}WINDOW_CNT;

the WINDOW_CNT stores the window index and count number, and will be compared by count number.

the DAG pointer is some instant of

typedef struct window_node {
    uint32_t index_of_W, index_of_R;
    size_t len;
}WINDOW_NODE;

which have window index, read index and length information

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.gitignore		.gitignore
DAG.h		DAG.h
Haffman.c		Haffman.c
Haffman.h		Haffman.h
LV_deep.c		LV_deep.c
LV_deep.h		LV_deep.h
Makefile		Makefile
README.md		README.md
RHT.h		RHT.h
SNP_vector.c		SNP_vector.c
SNP_vector.h		SNP_vector.h
Sam_file		Sam_file
alignment.h		alignment.h
ch20_reform.vcf		ch20_reform.vcf
cigar.h		cigar.h
create_RHT		create_RHT
create_RHT.cpp		create_RHT.cpp
dna_read		dna_read
freq.stat		freq.stat
func.h		func.h
graph.cpp		graph.cpp
graph.h		graph.h
kseq.h		kseq.h
ksw.c		ksw.c
ksw.h		ksw.h
run_rHAT		run_rHAT
run_rHAT.cpp		run_rHAT.cpp
test.cpp		test.cpp
test_f.cpp		test_f.cpp
w.index		w.index
zz		zz

sujunhao/_rHAT

Folders and files

Latest commit

History

Repository files navigation

rHAT

usage:

About

Resources

Stars

Watchers

Forks

Languages