forked from hyattpd/Prodigal
Prodigal Gene Prediction Software
License
sjackman/Prodigal
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/******************************************************************************* PRODIGAL (PROkaryotic DynamIc Programming Genefinding ALgorithm) Copyright (C) 2007-2014 University of Tennessee / UT-Battelle Code Author: Doug Hyatt This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. *******************************************************************************/ README for Command-Line Prodigal, Version 2.7.0 I. Introduction Prodigal (PROkaryotic DynamIc programming Genefinding ALgorithm) is an open source lightweight microbial genefinding program developed at University of Tennessee and Oak Ridge National Laboratory. The code was written by Doug Hyatt, with ideas from Loren Hauser, and with further assistance from Frank Larimer and Miriam Land. Gwo-Liang Chen and Philip Locascio also contributed to previous microbial genefinding work at ORNL prior to Prodigal. Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. The metagenomic/draft code was written by Doug Hyatt, with ideas from Loren Hauser and Edward C. Uberbacher. Detailed documentation for Prodigal can be found at the main web site (http://prodigal.ornl.gov/) or at the Prodigal github repository (http://github.com/hyattpd/prodigal/). To install Prodigal on Linux or MacOSX, just type 'make'. On Windows, you should compile with mingw (http://www.mingw.org/). Other methods of compilation are not supported on Windows. Note that compilation may take a few minutes due to the addition of precalculated training file data to the main program. We did this assuming the user would rather have a single executable and not depend on external files, but it does mean it takes a bit of time to compile. Do 'prodigal -h' to get a list of options, i.e.: Usage: prodigal [-a protein_file] [-c] [-d mrna_file] [-f out_format] [-g trans_table] [-h] [-i input_file] [-m mode] [-n] [-o output_file] [-q] [-s start_file] [-t train_file] [-v] [-w summ_file] Gene Modeling Parameters -m, --mode: Specify mode (normal, train, or anon). normal: Single genome, any number of sequences. (Default) train: Do only training. Input should be multiple FASTA of one or more closely related genomes. Output is a training file. anon: Anonymous sequences, analyze using preset training files, ideal for metagenomic data or single short sequences. -g, --trans_table: Specify a translation table to use Auto: Tries 11 then 4 (Default) 11: Standard Bacteria/Archaea 4: Mycoplasma/Spiroplasma #: Other genetic codes 1-25 -c, --nopartial: Closed ends. Do not allow partial genes (genes that run off edges or into gaps.) -n, --nogaps: Do not treat runs of N's as gaps. This option will build gene models that span runs of N's. -t, --training_file: Read and use the specified training file instead of training on the input sequence(s) (Only usable in normal mode.) Input/Output Parameters -i, --input_file: Specify input file (default stdin). -o, --output_file: Specify output file (default stdout). -a, --protein_file: Write protein translations to the named file. -d, --mrna_file: Write nucleotide sequences of genes to the named file. -w, --summ_file: Write summary statistics to the named file. -s, --start_file: Write all potential genes (with scores) to the named file. -f, --output_format: Specify output format (gbk, gff, sqn, or sco). gff: GFF format (Default) gbk: Genbank-like format sqn: Sequin feature table format sco: Simple coordinate output -q, --quiet: Run quietly (suppress normal stderr output). Other Parameters -h, --help: Print help menu and exit. -v, --version: Print version number and exit. II. Running Prodigal on Single Genomes Prodigal can be run in one pass on a single genome, regardless of if the genome consists of one sequence or many contigs. Prodigal can read sequences in FASTA format, as well as Genbank and EMBL formats. The Genbank and EMBL parsers are very simple/dumb, however, so these should be used with caution. To run Prodigal on a single finished genome, simply do: prodigal < genome.seqs > my.genes or prodigal -i genome.seqs -o my.genes By default, Prodigal outputs GFF format (in versions prior to 2.7.0, the default output format was a Genbank feature table). You can also specify gbk (Genbank table), sqn (NCBI Sequin-compatible format), or SCO (lightweight format used by TiCO). By default, Prodigal predicts genes that run off the edges of the sequence. In addition, Prodigal treats any two consecutive completely ambiguous codons (i.e. amino NN, or dna nnnnnn) as a gap. Prodigal will also predict partial genes that run into gaps. If you wish to disallow this (sometimes correct for finished genomes, but it seldom makes any difference), use the -c option, and Prodigal will no longer predict partial genes, i.e. prodigal -c -i genome.seqs -o my.genes If you do not wish for Prodigal to treat runs of N's as gaps, then you can disable this behavior with the '-n' option, e.g. prodigal -n -i genome.seqs -o my.genes Note that the above option will cause gene models occasionally to be built that span gaps, i.e. you will see runs of X's in your protein translations (and runs of N's in your mRNAs) if you have gaps in your sequences. If you wish to retain a training file (for future sequences, perhaps), you may run Prodigal in training mode: prodigal -m train -o my.trn (To write a training file) This will train on the supplied sequences and write to the 'my.trn' training file. To then analyze a sequence or sequences using this training file, you would do: prodigal -t my.trn -i genome.seq -o my.genes (To read/use the training file) Note that the above method of doing training has changed in 2.7.0 (and should now be much less confusing). III. Running Prodigal on Multiple Sequences/Draft/Plasmids/Chromosomes/etc. With version 2.00+, you may now run Prodigal in a single step on multiple FASTA files. The recommended way of doing this is to put all the contigs in a single file, i.e. cat genome.contig* > genome.allseqs.fna Prodigal can then be run as normal: prodigal -i genome.allseqs.fna -o my.genes Outputs will appear for each sequence in the same order as the input file. You may also run Prodigal in two steps: (1) training and (2) analysis, if you want to analyze sequences one at a time and store the output in separate files. Prodigal can train on a multiple FASTA format file, so you can do, for example: cat genome.contig* | prodigal -m train -o genome.trn to create a training file to use later. To do the analysis on single sequences once you have trained, do: prodigal -t my.trn < genome.contig1 > contig1.genes prodigal -t my.trn < genome.contig2 > contig2.genes prodigal -t my.trn < genome.contig3 > contig3.genes etc. One final note about training files: they are BINARY files, so if you train on some architectures, the training file may not work if you attempt to use it on a different architecture. IV. Running Prodigal on Metagenomic Samples Prodigal can be run on metagenomic sequences using the "-m anon" option. All "metagenomic" means in this case is that Prodigal uses the best of 50 pre-generated training files rather than doing any training of its own. "Anonymous" mode simply means that you don't know to which genome each sequence belongs ahead of time. Prior to 2.7.0, the "-p meta" option was used, but we decided to change this option to a more meaningful/accurate description, since "anonymous" mode should also be used to analyze short sequences (<100KB) for which Prodigal doesn't have enough training data to perform well in normal mode. The "-p meta" is deprecated and may be removed in a future version, but it still works as of 2.7.0. Prodigal scans the sequence and tries a variety of precalculated training files, and outputs the predictions that achieve the best score. Model information is printed in the comment lines showing which training file was ultimately used. You can run one sequence at a time, or on multiple FASTA. Either way works fine, since Prodigal doesn't need to train or do anything special. You cannot specify translation table, training files, etc., in anonymous mode, as Prodigal will simply use the characteristics of the 50 canned training files instead. V. Training Files and Different Versions of Prodigal Training files between different versions of Prodigal are NOT GUARANTEED TO BE COMPATIBLE. Specifically, training files from versions 1, 1.01, 1.02-1.05, 1.10 and 1.20+ are all incompatible with each other. VI. Obtaining a List of Protein Translations or Nucleotide Sequences At the user's request, Prodigal will print protein translations to a specified file. The -a option specifies a file name to write the protein translations to. Protein translations are written in FASTA format, with a header providing a numerical id for the gene, its location, and its strand (1 for forward, -1 for reverse). The -d option provides a similar function for the nucleotide sequences of genes. VII. Generating a List of All Potential Start/Stop Pairs and Summary Files The -s option can be used to write a file containing all the starts in the entire genome above a particular length (90bp is the #defined option for minimum gene size). This file is generally only useful if you are hand curating a genome and wish to see the scores of the starts not chosen by the program. The output looks like this: Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr 337 2799 + 339.43 323.32 16.11 ATG GGAG/GAGG 5-10bp 10.94 1.32 3.85 343 2799 + 313.32 323.67 -10.35 GTG 4Base/6BMM 13-15bp -2.92 -0.95 -6.48 346 2799 + 303.01 322.46 -19.45 TTG None None -8.11 1.12 -12.46 367 2799 + 314.39 323.67 -9.28 GTG GGxGG 5-10bp -2.37 -0.43 -6.48 433 2799 + 299.65 304.00 -4.35 GTG GGA/GAG/AGG 5-10bp 2.93 -0.79 -6.48 478 2799 + 277.33 292.78 -15.45 GTG None None -8.11 -0.86 -6.48 484 2799 + 284.17 289.52 -5.36 ATG None None -8.11 -1.10 3.85 559 2799 + 244.91 264.56 -19.64 TTG None None -8.11 0.93 -12.46 etc. where the first two values are the beginning and end, followed by the strand. The next three scores are the total score, the coding potential, and the start score (a composite of ATG/GTG/TTG score, RBS motif score, -1/-2 and -15 to -45 upstream region score, and length factors). Following these scores are the start codon, the RBS motif (for Shine-Dalgarno, we report only the bin assigned to this node, which may contain multiple possible motifs; for the non-SD motif organisms, the motif and spacer will be exactly determined), the spacer, the score for this motif, the upstream score for the -1/-2 and -15 to -45 region, and the type score for ATG/GTG/TTG. Normally, the sum of the last three scores will be the same as the start score, but this score is penalized for genes less than 250bp. The dynamic programming also includes some contextual factors not displayed in this file, such as operon distances, overlaps, etc., so the highest scoring total score start may not always be what is used in the final models. This file is a DUMP of every single ORF with a potential start, even if the score is horribly negative. This in no way represents any sort of actual output that should be taken as real genes. It is simply a resource provided for those who wish to examine individual ORFs in more detail to see how Prodigal scores the various potential starts. In addition, the "-w" option will generate a file of summary statistics about the genome, including average contig length, average gene length, start codon usage across the genome, and RBS motif usage across the genome. VIII. Mycoplasma and Other Nonstandard Stop Codons Prodigal supports all of the Genbank translation tables with the '-g' option. The expected value for the '-g' option is the number of the translation table to use. The primary code is 11 (standard microbial table). Mycoplasma uses 4. So '-g 4' would be the correct option to analyze mycoplasma genomes. Translation tables are available at NCBI: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi If the user does not specify a translation table, the program will check the average gene length in the training set and automatically try genetic code 4 if genetic code 11 doesn't seem to work. IX. Excluding RNA Genes and Other Regions from Gene Predictions To avoid predicting genes in known non-protein-coding regions, the user may mask out tRNAs, rRNAs, and other miscellaneous RNAs/objects by replacing their sequence with n's. This will cause these regions to be treated as gaps. The user should run with the '-c' option in this case to prevent gene models from being built that run into or span gaps. X. Understanding the Prodigal Output As of 2.00, Prodigal now outputs detailed information about each gene via the attributes field in GFF3 (or the /note field in Genbank). This information is the same as described above in section VII. In addition, the comment lines contain global information about the data, i.e. sequence length, FASTA header, a numerical id, the GC content, whether or not this is a single- or metagenomic- genome analysis, the translation table, and whether or not the organism uses the SD motif (uses_sd). The "partial=01", etc., field is used to indicate if genes continue off the edges of the contig. A '0' indicates that the gene is contained within the contig, and a '1' indicates the gene runs off that edge. So '11' runs off both edges of the contig, '10' runs off the left edge, '01' runs off the right edge, and '00' is fully contained within the contig. XI. How Much Sequence Does Prodigal Need to Train? With v2.00, you now have the option of running the "anonymous" gene finder on small contigs. Ideally, Prodigal should be given at least 100 high quality genes on which to train (100kbp+). While Prodigal will accept a 20KB sequence and train on it, keep in mind that only giving the program 20 genes to train on is not going to produce great results. You are probably much better off running the anonymous mode version with '-m anon' (or, better yet, get more contigs from the same genome and put them together so you have enough data to train). People frequently submit single small (20kbp) segments to the web server, but they are likely not getting very good results by doing so (since 20kbp is not enough to train on). XII. Questions/Concerns If you have any questions about this software, feel free to contact the author Doug Hyatt at doug.hyatt@gmail.com.
About
Prodigal Gene Prediction Software
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- C 100.0%