GitHub - Bioinfo-Tools/PANGEA-plus: A new implementation of PANGEA pipeline for metagenomics with multiple classification methods and consensus analysis

#PANGEA+

A new implementation of PANGEA pipeline for faster and more accurated metagenomics with multiple classification methods and consensus analysis.

#Download

(LINUX):

wget https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -O BioinfoTools_PANGEA-plus.tar.gz

(MAC):

curl https://github.com/Bioinfo-Tools/PANGEA-plus/tarball/master -o BioinfoTools_PANGEA-plus.tar.gz

#Extract the files:

tar –xvf BioinfoTools_PANGEA-plus.tar.gz

Your work dir should be set as the PANGEA-plus directory.

cd BioinfoTools_PANGEA-plus
export PANGEAWD=$PWD

#Install parallel BLAST (for High Performance Computing clusters)

cd $PANGEAWD/Classify/Runblast
sh install_MPI_blast.sh

#Trimming your input sequences

cd $PANGEAWD/Trim
perl trim2.3.pl -a ../input_A.txt -b ../input_B.txt -g 100

where: perl trim2.3.pl ... -a raw illumina input file read 1 -b raw illumina input file read 2 (if any) -g size of GAP between paired-ends (if any) -t truncate size (if any) -q quality file (in case of FASTA input) -qc quality cutoff value -lc minimum length

Supported formats: FASTA, FASTQ and QSEQ.

Results will be saved in $PANGEAWD/output/trim2 folder

#Download / Install Blast

cd $PANGEAWD/Classify/Runblast
sh install_blast.sh

#Download NCBI database for classification

cd $PANGEAWD
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz

#Format the database

$PANGEAWD/Classify/Runblast/makeblastdb -in $PANGEAWD/nt -out $PANGEAWD/nt -dbtype nucl

#Classify your sequences using parallel BLAST search

cd $PANGEAWD/Classify/Runblast

Example of parallel BLAST (MPI-blastn) executed in a PBS/Torque/Maui HPC cluster:

Use an example submission script available in $PANGEAWD/Scripts directory

*EDIT THE FILE submit_MPI-blast.job FIRST!

qsub $PANGEAWD/Scripts/submit_MPI-blast.job

where: input.fasta refers to your sequences after trimming.

For running parallel blast for multiple input files at the same time:

*EDIT THE FILE submit_multiple_MPI-blast.job FIRST! Follow the instructions in the file.

*Replace ./dir/ by your input directory and change the values of these parameters before running: "database="; "total_processes="; "nodes="

for i in ./dir/*.fasta; do qsub submit_multiple_MPI-blast.job -v in=`echo $i`,out=`echo $i.txt`,database=database_name,nodes=4, total_processes=16; done

where: ./dir/ is your input sequences directory nodes= is the number of requested nodes total_processes= is the total number of processes requested database= is the name of database The output files will have the same name of your inputs, but with .txt suffix.

Example using your own blastn installation:

export PATH=$PANGEAWD/Classify/Runblast:$PATH
blastn -query input.fasta -db database.formated -outfmt 6 -out blast_output.txt

#Parse the taxonomic classification based on the NCBI taxonomy databases

Running NCBI-taxcollector:

cd $PANGEAWD/Tax_class
make all
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.tar.gz
tar -xvf taxdump.tar.gz
tar -xvf gi_taxid_nucl.dmp.tar.gz
./tax_class -c
perl NCBI-taxcollector-0.01.pl –f $PANGEAWD/parallel_output.txt -o $PANGEAWD/parallel_output_class.txt > report.txt

where: parallel_output.txt is the mpiblastn results parallel_output_class.txt are the parsed and classified output generated by this program.

#Classify your sequences using RDP Classifier search

cd $PANGEAWD/Classify/RunRDP/
sh install_RDPClassifier.sh
java -Xmx1g -jar rdp_classifier-2.5.jar -q $PANGEAWD/input_trimmed.txt -o output_rdp.txt

Where: -q refers to the query file. -o refers to the output file.

#Classify your sequences using parallel SOAP Aligner search

Format your database:

 cd $PANGEAWD/Classify/Runsoap/soap2.21release
 ./2bwt-builder $PANGEAWD/database.fasta

Run sequence search:

 ./soap -a $PANGEAWD/input_trimmed.fasta -D $PANGEAWD/database.fasta.index -o $PANGEAWD/output_soap.txt -p 8 -M 4

Where: -D Prefix name for reference index [*.index]. -a Query file, for SE reads alignment or one end of PE reads. -b Query b file, one end of PE reads. -o Ouput file -p Number of threads -M INT Match mode for each read or the seed part of read, which shouldn't contain more than 2 mismaches, [4] 0: exact match only 1: 1 mismatch match only 2: 2 mismatch match only 3: [gap] (coming soon) 4: find the best hits

#Run Consensus Analysis

cd $PANGEAWD/Consensus
perl Consensus_BLAST_SOAP_RDP-1.1.pl -b output_blast_class.txt -r output_rdp.txt -o output_consensus.txt

Where: -b Classification results (Blast) parsed by NCBI-taxcollector -r Classification results (RDP) -s Classification results (SOAP2) -o Output file (txt)

The output shall look like this:

     S000008953	[0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6;		92.61	1435	81	23	29	1452	1	1421	0.0	2039
     #Matches found: 4
     S000010870	[0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6;		91.78	1435	90	26	49	1469	1	1421	0.0	1971
     #Matches found: 4
     S000014058	[0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6;		92.20	1435	88	22	29	1453	1	1421	0.0	2008
     #Matches found: 4
     S000016099	[0]Bacteria;[1]Firmicutes;[2]Bacilli;[3]Bacillales;[4]Bacillaceae;[5]Bacillus;[6]Bacillus_sp._8A18S6;		91.66	1438	86	29	49	1469	1	1421	0.0	1960
     #Matches found: 4

#Cluster your results by identity:

Example for 80% identity*:

perl $PANGEAWD/Megaclust/megaclust2.pl -i $PANGEAWD/output_consensus.txt -o $PANGEAWD/output_consensus.megaclust_80_hits.txt -b 100 -s 80 -e 1e-20

*More examples and automatic scripts at $PANGEAWD/Scripts

#Generate summary table for classified results:

Example for Domain level (80%) similarity*:

perl $PANGEAWD/Megaclustable/megaclustable.pl -m $PANGEAWD/output_consensus.megaclust_80_hits.txt -t 0 -o $PANGEAWD /results/megaclustable/DomainTable.txt

*Note: in the –m option you shall list all the ouput files generated by the megaclust execution for every sample. More examples and automatic scripts at $PANGEAWD/Scripts.

The classification output should be like this:

		1	2	3	4	5	6	7	8	9	10
Bacteria	479	4	32	7507	11977	13245	2129	11222	539	2411	
Eukaryota	1	4	5	5	2	17	78	3	10	3	
Archaea		1	0	0	0	0	0	0	0	0	1

#References

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Barcode		Barcode
CD-HIT		CD-HIT
Cdclustable		Cdclustable
Chomp		Chomp
Classify		Classify
Consensus		Consensus
Megaclust		Megaclust
Megaclustable		Megaclustable
Scripts		Scripts
Selector		Selector
Shannon		Shannon
Tax_class		Tax_class
Trim		Trim
Unclas_Sel		Unclas_Sel
validation_dataset		validation_dataset
.gitignore		.gitignore
README.md		README.md

Bioinfo-Tools/PANGEA-plus

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages