GitHub - sjackman/Prodigal: Prodigal Gene Prediction Software

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
CHANGES		CHANGES
LICENSE		LICENSE
Makefile		Makefile
README		README
VERSION		VERSION
anonymous.c		anonymous.c
anonymous.h		anonymous.h
bitmap.c		bitmap.c
bitmap.h		bitmap.h
datatypes.h		datatypes.h
dprog.c		dprog.c
dprog.h		dprog.h
gene.c		gene.c
gene.h		gene.h
main.c		main.c
node.c		node.c
node.h		node.h
preset.c		preset.c
sequence.c		sequence.c
sequence.h		sequence.h
setup.c		setup.c
setup.h		setup.h
summary.c		summary.c
summary.h		summary.h
training.c		training.c
training.h		training.h

Repository files navigation

/*******************************************************************************
PRODIGAL (PROkaryotic DynamIc Programming Genefinding ALgorithm)
Copyright (C) 2007-2014 University of Tennessee / UT-Battelle

Code Author: Doug Hyatt

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*******************************************************************************/

README for Command-Line Prodigal, Version 2.7.0

I. Introduction

Prodigal (PROkaryotic DynamIc programming Genefinding ALgorithm) is an open
source lightweight microbial genefinding program developed at University of
Tennessee and Oak Ridge National Laboratory.

The code was written by Doug Hyatt, with ideas from Loren Hauser, and with
further assistance from Frank Larimer and Miriam Land. Gwo-Liang Chen and
Philip Locascio also contributed to previous microbial genefinding work at ORNL
prior to Prodigal.

Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ.
Prodigal: prokaryotic gene recognition and translation initiation
site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119.

The metagenomic/draft code was written by Doug Hyatt, with ideas from Loren
Hauser and Edward C. Uberbacher.

Detailed documentation for Prodigal can be found at the main web site
(http://prodigal.ornl.gov/) or at the Prodigal github repository
(http://github.com/hyattpd/prodigal/).

To install Prodigal on Linux or MacOSX, just type 'make'. On Windows, you
should compile with mingw (http://www.mingw.org/). Other methods of
compilation are not supported on Windows.

Note that compilation may take a few minutes due to the addition of
precalculated training file data to the main program. We did this assuming
the user would rather have a single executable and not depend on external
files, but it does mean it takes a bit of time to compile.

Do 'prodigal -h' to get a list of options, i.e.:

Usage: prodigal [-a protein_file] [-c] [-d mrna_file] [-f out_format]
[-g trans_table] [-h] [-i input_file] [-m mode] [-n]
[-o output_file] [-q] [-s start_file] [-t train_file]
[-v] [-w summ_file]

Gene Modeling Parameters

-m, --mode: Specify mode (normal, train, or anon).
normal: Single genome, any number of
sequences. (Default)
train: Do only training. Input should
be multiple FASTA of one or more
closely related genomes. Output
is a training file.
anon: Anonymous sequences, analyze using
preset training files, ideal for
metagenomic data or single short
sequences.
-g, --trans_table: Specify a translation table to use
Auto: Tries 11 then 4 (Default)
11: Standard Bacteria/Archaea
4: Mycoplasma/Spiroplasma
#: Other genetic codes 1-25
-c, --nopartial: Closed ends. Do not allow partial genes
(genes that run off edges or into gaps.)
-n, --nogaps: Do not treat runs of N's as gaps. This option
will build gene models that span runs of N's.
-t, --training_file: Read and use the specified training file
instead of training on the input sequence(s)
(Only usable in normal mode.)

Input/Output Parameters

-i, --input_file: Specify input file (default stdin).
-o, --output_file: Specify output file (default stdout).
-a, --protein_file: Write protein translations to the named file.
-d, --mrna_file: Write nucleotide sequences of genes to the
named file.
-w, --summ_file: Write summary statistics to the named file.
-s, --start_file: Write all potential genes (with scores) to the
named file.
-f, --output_format: Specify output format (gbk, gff, sqn, or sco).
gff: GFF format (Default)
gbk: Genbank-like format
sqn: Sequin feature table format
sco: Simple coordinate output
-q, --quiet: Run quietly (suppress normal stderr output).

Other Parameters

-h, --help: Print help menu and exit.
-v, --version: Print version number and exit.

II. Running Prodigal on Single Genomes

Prodigal can be run in one pass on a single genome, regardless of if the
genome consists of one sequence or many contigs. Prodigal can read sequences
in FASTA format, as well as Genbank and EMBL formats. The Genbank and EMBL
parsers are very simple/dumb, however, so these should be used with caution.

To run Prodigal on a single finished genome, simply do:

prodigal < genome.seqs > my.genes

prodigal -i genome.seqs -o my.genes

By default, Prodigal outputs GFF format (in versions prior to 2.7.0, the
default output format was a Genbank feature table). You can also specify
gbk (Genbank table), sqn (NCBI Sequin-compatible format), or SCO (lightweight
format used by TiCO).

By default, Prodigal predicts genes that run off the edges of the sequence. In
addition, Prodigal treats any two consecutive completely ambiguous codons (i.e.
amino NN, or dna nnnnnn) as a gap. Prodigal will also predict partial genes
that run into gaps.

If you wish to disallow this (sometimes correct for finished genomes, but it
seldom makes any difference), use the -c option, and Prodigal will no longer
predict partial genes, i.e.

prodigal -c -i genome.seqs -o my.genes

If you do not wish for Prodigal to treat runs of N's as gaps, then you can
disable this behavior with the '-n' option, e.g.

prodigal -n -i genome.seqs -o my.genes

Note that the above option will cause gene models occasionally to be built that
span gaps, i.e. you will see runs of X's in your protein translations (and
runs of N's in your mRNAs) if you have gaps in your sequences.

If you wish to retain a training file (for future sequences, perhaps), you may
run Prodigal in training mode:

prodigal -m train -o my.trn (To write a training file)

This will train on the supplied sequences and write to the 'my.trn' training
file. To then analyze a sequence or sequences using this training file, you
would do:

prodigal -t my.trn -i genome.seq -o my.genes (To read/use the training
file)

Note that the above method of doing training has changed in 2.7.0 (and should
now be much less confusing).

III. Running Prodigal on Multiple Sequences/Draft/Plasmids/Chromosomes/etc.

With version 2.00+, you may now run Prodigal in a single step on multiple
FASTA files. The recommended way of doing this is to put all the contigs in a
single file, i.e.

cat genome.contig* > genome.allseqs.fna

Prodigal can then be run as normal:

prodigal -i genome.allseqs.fna -o my.genes

Outputs will appear for each sequence in the same order as the input file.

You may also run Prodigal in two steps: (1) training and (2) analysis, if you
want to analyze sequences one at a time and store the output in separate files.

Prodigal can train on a multiple FASTA format file, so you can do, for example:

cat genome.contig* | prodigal -m train -o genome.trn

to create a training file to use later.

To do the analysis on single sequences once you have trained, do:

prodigal -t my.trn < genome.contig1 > contig1.genes
prodigal -t my.trn < genome.contig2 > contig2.genes
prodigal -t my.trn < genome.contig3 > contig3.genes
etc.

One final note about training files: they are BINARY files, so if you train on
some architectures, the training file may not work if you attempt to use it on a
different architecture.

IV. Running Prodigal on Metagenomic Samples

Prodigal can be run on metagenomic sequences using the "-m anon" option. All
"metagenomic" means in this case is that Prodigal uses the best of 50
pre-generated training files rather than doing any training of its own.
"Anonymous" mode simply means that you don't know to which genome each sequence
belongs ahead of time.

Prior to 2.7.0, the "-p meta" option was used, but we decided to change this
option to a more meaningful/accurate description, since "anonymous" mode should
also be used to analyze short sequences (<100KB) for which Prodigal doesn't
have enough training data to perform well in normal mode. The "-p meta" is
deprecated and may be removed in a future version, but it still works as of
2.7.0.

Prodigal scans the sequence and tries a variety of precalculated training files,
and outputs the predictions that achieve the best score. Model information is
printed in the comment lines showing which training file was ultimately used.
You can run one sequence at a time, or on multiple FASTA. Either way works
fine, since Prodigal doesn't need to train or do anything special.

You cannot specify translation table, training files, etc., in anonymous mode,
as Prodigal will simply use the characteristics of the 50 canned training files
instead.

V. Training Files and Different Versions of Prodigal

Training files between different versions of Prodigal are NOT GUARANTEED TO BE
COMPATIBLE. Specifically, training files from versions 1, 1.01, 1.02-1.05, 1.10
and 1.20+ are all incompatible with each other.

VI. Obtaining a List of Protein Translations or Nucleotide Sequences

At the user's request, Prodigal will print protein translations to a specified
file. The -a option specifies a file name to write the protein translations to.
Protein translations are written in FASTA format, with a header providing a
numerical id for the gene, its location, and its strand (1 for forward, -1 for
reverse). The -d option provides a similar function for the nucleotide
sequences of genes.

VII. Generating a List of All Potential Start/Stop Pairs and Summary Files

The -s option can be used to write a file containing all the starts in the
entire genome above a particular length (90bp is the #defined option for minimum
gene size). This file is generally only useful if you are hand curating a
genome and wish to see the scores of the starts not chosen by the program. The
output looks like this:

Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr

337 2799 + 339.43 323.32 16.11 ATG GGAG/GAGG 5-10bp 10.94 1.32 3.85
343 2799 + 313.32 323.67 -10.35 GTG 4Base/6BMM 13-15bp -2.92 -0.95 -6.48
346 2799 + 303.01 322.46 -19.45 TTG None None -8.11 1.12 -12.46
367 2799 + 314.39 323.67 -9.28 GTG GGxGG 5-10bp -2.37 -0.43 -6.48
433 2799 + 299.65 304.00 -4.35 GTG GGA/GAG/AGG 5-10bp 2.93 -0.79 -6.48
478 2799 + 277.33 292.78 -15.45 GTG None None -8.11 -0.86 -6.48
484 2799 + 284.17 289.52 -5.36 ATG None None -8.11 -1.10 3.85
559 2799 + 244.91 264.56 -19.64 TTG None None -8.11 0.93 -12.46
etc.

where the first two values are the beginning and end, followed by the strand.
The next three scores are the total score, the coding potential, and the start
score (a composite of ATG/GTG/TTG score, RBS motif score, -1/-2 and -15 to -45
upstream region score, and length factors). Following these scores are the
start codon, the RBS motif (for Shine-Dalgarno, we report only the bin assigned
to this node, which may contain multiple possible motifs; for the non-SD motif
organisms, the motif and spacer will be exactly determined), the spacer, the
score for this motif, the upstream score for the -1/-2 and -15 to -45 region,
and the type score for ATG/GTG/TTG. Normally, the sum of the last three scores
will be the same as the start score, but this score is penalized for genes less
than 250bp.

The dynamic programming also includes some contextual factors not displayed in
this file, such as operon distances, overlaps, etc., so the highest scoring
total score start may not always be what is used in the final models.

This file is a DUMP of every single ORF with a potential start, even if the
score is horribly negative. This in no way represents any sort of actual
output that should be taken as real genes. It is simply a resource provided
for those who wish to examine individual ORFs in more detail to see how Prodigal
scores the various potential starts.

In addition, the "-w" option will generate a file of summary statistics about
the genome, including average contig length, average gene length, start codon
usage across the genome, and RBS motif usage across the genome.

VIII. Mycoplasma and Other Nonstandard Stop Codons

Prodigal supports all of the Genbank translation tables with the '-g' option.
The expected value for the '-g' option is the number of the translation table to
use.

The primary code is 11 (standard microbial table). Mycoplasma uses 4. So
'-g 4' would be the correct option to analyze mycoplasma genomes.

Translation tables are available at NCBI:
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

If the user does not specify a translation table, the program will check the
average gene length in the training set and automatically try genetic code 4
if genetic code 11 doesn't seem to work.

IX. Excluding RNA Genes and Other Regions from Gene Predictions

To avoid predicting genes in known non-protein-coding regions, the user may mask
out tRNAs, rRNAs, and other miscellaneous RNAs/objects by replacing their
sequence with n's. This will cause these regions to be treated as gaps.
The user should run with the '-c' option in this case to prevent gene models
from being built that run into or span gaps.

X. Understanding the Prodigal Output

As of 2.00, Prodigal now outputs detailed information about each gene via the
attributes field in GFF3 (or the /note field in Genbank). This information is
the same as described above in section VII. In addition, the comment lines
contain global information about the data, i.e. sequence length, FASTA header, a
numerical id, the GC content, whether or not this is a single- or metagenomic-
genome analysis, the translation table, and whether or not the organism uses
the SD motif (uses_sd).

The "partial=01", etc., field is used to indicate if genes continue off the
edges of the contig. A '0' indicates that the gene is contained within the
contig, and a '1' indicates the gene runs off that edge. So '11' runs off both
edges of the contig, '10' runs off the left edge, '01' runs off the right edge,
and '00' is fully contained within the contig.

XI. How Much Sequence Does Prodigal Need to Train?

With v2.00, you now have the option of running the "anonymous" gene finder
on small contigs. Ideally, Prodigal should be given at least 100 high quality
genes on which to train (100kbp+). While Prodigal will accept a 20KB sequence
and train on it, keep in mind that only giving the program 20 genes to train
on is not going to produce great results. You are probably much better off
running the anonymous mode version with '-m anon' (or, better yet, get more
contigs from the same genome and put them together so you have enough data
to train). People frequently submit single small (20kbp) segments to the web
server, but they are likely not getting very good results by doing so (since
20kbp is not enough to train on).

XII. Questions/Concerns

If you have any questions about this software, feel free to contact the author
Doug Hyatt at doug.hyatt@gmail.com.