Skip to content

rcanovas/libCSAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

libCSAM

libCSAM contains several C++ codes for compress,decompress, and access each of the fields of any SAM format file. Part of this library was taking from Francisco Claude libcds project (https://github.com/fclaude/libcds/). Also the boost C++ library must be installed in your computer (http://www.boost.org/).

Methods

-[CompressSAM]:

  Use: ./CompressSAM <arch> <opt>
  opt: 
      -q qm: How the Quality values are stored. q = 0 - lossles, 1 - pblock, 2 - rblock. Default: q = 1
      -m mode: Mode use to store the Representative Array. mode = 0 ASCII, 1 Binary Global, 2 Binary Local. Default mode=1
      -l lossy: lossy parameter use to compress the quality score depending on the mode use. Default: 0
      -s sample: Sample rate used for Fields and Quality structure. Default: s = 1000
      -p position: Sample position rate used for Seq, Rname, and Pos. Default: p = 1000
      
      	output: .cqual File

	Example: ./CompressSAM ./data/file.sam -q 2 -l 60 -s 500 -p 100
	output:  file.sam.csam

-[CompressQual]:

Use: ./CompressQual <arch> <opt>
arch: 	SAM format file. Note that only the quality field will be use.
opt: 
	-q mode: 	How the Quality values are stored. 
		  	mode=0 	gzip 
		  	mode=1 	P-Block  
		  	mode=2 	R-Block
		  	mode=3 	Bins base on the LogBinning Wan et al.
				2011 paper. Note that UniBinning is 
			       	also implemented but is not included 
			       	in this program. 
			mode=4 	Only one value is stored to represent
			        all the qualities  
		  	default:  0
	-m mode: 	If 'q' is 1 or 2, give the mode use to store
					the Representative Array. 
			 		mode=0 	ASCII 
			 		mode=1 	Binary Global 
			 		mode=2 	Binary Local
			 		default=1
	-l lossy: 	Lossy parameter use to compress the quality
					score depending on the mode use. 
					P-Block: maximum distance between the values
						and their representative (1,2,3,4)
					R-BLock: Max/Min maximum diference allowed,
						 recieve the extra overhead pbb 
						 (5, 20, 40, 100) 
					Bins:    number of bins to be use (94, 20,
						 10, 5)
	-r :     	Reorder the quality scores by reference and
					position. Also the permutation is stored
	-s sample:  Size of the sample rate that will be use.
					default: no sample. 

	output: .cqual File

	Example: ./CompressQual ./data/file.sam -q 2 -l 60 -s 500 -r
	output:  file.sam.cqual
	Compress the quality scores of file.sam using R-Block with r=1.60 
	storing a sample every 500 lines and reordering the file 
	by reference and position in the reference.

-[CompressSeq]:

  Use: ./CompressSeq <arch> <opt>
  <arch>:  must be a .sam or .rps (rname pos seq) file
  opt: 
      -s sample:  size of the sample rate that will be use. Default: no sample;
      
  output: .cseq file

-[DecompressSAM]:

   Use: ./DecompressSAM <arch>
   arch: .csam File
   output:.sam File containing the SAM information

-[DecompressQual]:

   Use: ./DecompressQual <arch>
   arch: .cqual File
   output:.qual File containing the quality scores

-[DescompressSeq]:

   Use: ./DecompressSeq <arch>
   arch: .cseq File
   output:.seq File containing only the sequence field

-[CountReadsSample]: Counts the number of read within each of the intervals in the sample_interval size. Also gives some stats about the interval found.

    Use: ./CountReadsSample <arch>.csam sample_interval_file
    output: On screen

-[GetIntervalSAM]: Extracs from a csam file all the alignment lines wihtin the interval (ref,x,y)

    Use: ./GetIntervalSAM <arch> ref_name pos_x pos_y
    arch: .csam File
    ref: reference name
    pos_x, pos_y:  interval positions
    output: file_name + "_inter.sam" File

-[GetIntervalSeq]: Same as before but only extracting only the SEQ fields.

-[GetIntervalSAMSample]: Same as GetIntervalSAM but receive a file containing many intervals to query

       Use: ./GetIntervalSAMSample <arch>.csam sample_interval_file
       Use:	./GetIntervalSAMSample <arch>.csam sample_interval_file BuffSizeInBytes

-[GetIntervalSeqSample]: Same as before but only extracting the SEQ field

-[GetIntervalSSN]: Same as GetIntervalSAMSample but extracting only a selection of the Fields and replacing the rest with empty values. For the moment it extrac a minimal set (QNAME FLAG RNAME POS MAPQ SEQ). Modify line 113 of the file to change this option (TODO: do it by command line)

       Use: ./GetIntervalSSN <arch>.csam sample_interval_file
       output: file_name + "_inter.sam" File

Stats Methods

Also this library contains in the stats_src the following programs:

-[Change_qual]: Changes the quality field of a SAM file with the quality file given

    Use: python ./Change_qual.py file.sam new_qual.qual
    output: newSAM.sam

-[Change_qual_letter]: Changes the quality field of a SAM file to only one quality score value

    Use: python ./Change_qual_letter.py file.sam letter name_output.sam
    output: name_output.sam

-[ComputeEntroHist]: Computes the Entropy of order 0 of a file return the histogram of each symbol

  (compile first: g++ -o ComputeMetrics ComputeMetrics.cpp)
  Use: ./ComputeEntroHist <arch> <out_arch>
  output: In screen prints the entropy of the file, and in <out_arch> returns the histogram of the symbols

-[ComputeMetrics]: Compares two quality files and compute some distance metrics

    (compile first: g++ -o ComputeMetrics ComputeMetrics.cpp)
    Use: ./ComputeMetrics qualityFile.qual referenceFile.qual
    output : In screen prints the Manhattan, Max:Min, MSE, Chebyshev, Soergel and Lorentzian metrics.

-[Get_qual]: Extracs the quality field of a SAM file

    Use: python ./Get_qual.py file.sam
    output:  file.sam.qual

-[Get_seq]: Extracs the reference, positon and sequence field of a SAM file

    Use: python ./Get_seq.py file.sam
    output:  file.sam.rps

-[getVCF]: Simple example of how to generate the vcf file of a BAM file using mpileup and bcftools.

    Use: ./getVCF reference_file file.bam
    output:  file.bam.vcg

-[get_vcf_stats]: Compares two vcf files computing true positive, false positive, false negative, precision, recall, and MSE.

    Use: python ./get_vcf_stats.py original.vcf second.vcf
    output: Returns stats in screen

Note: These codes assume that the computer have enough RAM memory to read and store the complete input.

About

No description, website, or topics provided.

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE
Unknown
COPYING

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages