GitHub - genome-vendor/gmap-gsnap: Unofficial repo for software vendoring or packaging purposes

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
debian		debian
src		src
tests		tests
util		util
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
Makefile.am		Makefile.am
Makefile.in		Makefile.in
NEWS		NEWS
README		README
VERSION		VERSION
acinclude.m4		acinclude.m4
aclocal.m4		aclocal.m4
config.site		config.site
configure		configure
configure.ac		configure.ac
Repository files navigation

0.  Availability
============

The source code for this package is available from
http://research-pub.gene.com/gmap.  License terms are provided in the
COPYING file.


1.  Building and installing GMAP and GSNAP
==========================================

Prerequisites: a Unix system (including Cygwin on Windows), a C
compiler, and Perl

Step 1: Set your site-specific variables by editing the file
config.site.  In particular, you should set appropriate values for
"prefix" and probably for "with_gmapdb", as explained in that file.
If you are compiling this package on a Macintosh, you may need to edit
CFLAGS to be

CFLAGS = '-O3 -m64'

since Macintosh machines will make only 32-bit executables by default.


Step 2: Build, test, and install the programs, by running the
following GNU commands

    ./configure
    make
    make check   (optional)
    make install

Note 1: Instead of editing the config.site file in step 1, you may type
everything on the command line for the ./configure script in step 2,
like this

    ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb

If you omit --with-gmapdb, it defaults to ${prefix}/share.  If you
omit --prefix, it defaults to /usr/local.  Note that on the command
line, it is "with-gmapdb" with a hyphen, but in a config.site file,
it is "with_gmapdb" with an underscore.


Note 2: If you want to keep your version of config.site or have
multiple versions, you can save the file to a different filename, and
then refer to it like this

    ./configure CONFIG_SITE=<config site file>


Note 3: GSNAP is designed for short reads of a limited length, and
relies upon a maximum read length variable MAX_READLENGTH defined at
compile time (default 250).  You may set this variable by providing it
to configure like this

    ./configure MAX_READLENGTH=<length>

or by defining it in your config.site file (or in the file provided to
configure as the value of CONFIG_SITE).  Or you may set the value of
MAX_READLENGTH as an environment variable before calling ./configure.
If you do not set MAX_READLENGTH, it will have the default value shown
when you run "./configure --help".

Note that MAX_READLENGTH applies only to GSNAP.  GMAP, on the other
hand, can process queries up to 1 million bp.


Note 4: GSNAP can read from gzip-compressed FASTA or FASTQ input
files.  This feature requires the zlib library to be present
(available from http://www.zlib.net).  The configure program will
detect the availability of zlib automatically.  However, to disable
this feature, you can add "--disable-zlib" to the ./configure command
or edit your config.site file to have the command "disable_zlib".


Note 5: GSNAP optionally supports the Goby input and output file
formats.  To implement this functionality, you need to obtain and
compile the libraries from http://campagnelab.org/software/goby.  If
the resulting header files are located in /path/to/goby/include and
the library files are in /path/to/goby/lib, you can then add the flag
"--with-goby=/path/to/goby" to your ./configure command or edit your
config.site file to have this directory as the value for "with_goby".



2.  Downloading a pre-built GMAP/GSNAP database
===============================================

A GMAP/GSNAP "database" is a set of genomic index files, representing
the genome in a hash table format.  You can use the programs
gmap_build or gmap_setup to build your own database (as described
below), but you can started quickly by downloading a pre-built
GMAP/GSNAP database from the same place you obtained the GMAP program
(see above for URL).

Place the database in the GMAPDB directory you specified in the
config.site file when you built the gmap program.  You should include
a subdirectory for each GMAP database; for example, if you downloaded
a database called <genome>, your directory structure should look like
this

    /path/to/gmapdb/<genome>/
    /path/to/gmapdb/<genome>/<genome>.chromosome
    /path/to/gmapdb/<genome>/<genome>.chromosome.iit
    ...
    /path/to/gmapdb/<genome>/<genome>.version

Note that the GMAP database format has changed with the 2011-08-15
release.  Older versions of GMAP and GSNAP will not work with the
newer databases, but the current version of the programs is backward
compatible with the older databases.  Also, versions of GMAP and GSNAP
before 2008 may require symbolic links to work even with the older
databases.

The old databases have the index files <genome>.ref3offsets and
<genome>.ref3positions.  The new databases have the index files
<genome>.ref12153gammaptrs, <genome>.ref12153offsetscomp, and
<genome>.ref153positions, if built using a base size of 12, a k-mer
size of 15, and skipping every 3 bp in the genome.  If the k-mer size
is equal to the base size, then the gammaptrs file will be absent.

Also, the name of the positions file has changed starting with
2012-02-14 version.  Previously, the file was named
<genome>.ref12153positions, but it is now named
<genome>.ref153positions, since the contents are independent of the
base size.  If you create a database with a newer version of the
package, and want older versions of the GMAP or GSNAP to work with
these newer versions, you will need to make a symbolic link like this:

    ln -s <genome>.ref153positions <genome>.ref12153.positions



3.  Setting up to build a GMAP/GSNAP database (one chromosome per FASTA entry)
==============================================================================

You can also build your own genomic database, using one of two utility
programs provided with this package: gmap_build (the newer, one-step
method) or gmap_setup (the older way that uses Makefile and requires
multiple steps).  Note that the total sequence length in your database
cannot exceed 2^32 = 4,294,967,296 (about 4 billion) bp.  This is
because the format uses 32-bit pointers.  If your total sequence
provided to the utility programs exceeds 4 billion bp, the programs
will abort.  Below I use the "genome" and "chromosome", but the input
sequences can be anything you wish to align to, including transcripts
or small fragments.

You will need to start with a set of FASTA files containing either
entire chromosomes or contigs that represent pieces of chromosomes.
If your FASTA entries each contain a single chromosome, and the
accession for each chromosome is the chromosome number/letter, you can
simply run this command

    gmap_build -d <genome> [-k <kmer size>] <fasta_files...>

which will build and install the GMAP index files in your default
GMAPDB location.

You can see the full usage of gmap_build by doing "gmap_build --help",
but here are some useful flags.  If your FASTA files are gzipped, you
can add the flag "-g" to gmap_build.  You can control the k-mer size
for the genomic index with the -k flag, which can range from 12 to 15.
The default value for -k is 15, but this requires your machine to have
4 GB of RAM to build the indices.  If you do not have 4 GB of RAM,
then you will need to reduce the value of -k or find another machine.
Here are the RAM requirements for building various indices:

    k-mer of 12: 64 MB
    k-mer of 13: 256 MB
    k-mer of 14: 1 GB
    k-mer of 15: 4 GB

These are the RAM requirements for building indices, but not to run
the GMAP/GSNAP programs once the indices are built, because the
genomic indices are compressed.  For example, the genomic index for a
k-mer of 15 gives a gammaptrs file of 64 MB and an offsetscomp file of
about 350 MB, much smaller than the 4 GB that would otherwise be
required.  Therefore, you may want to build your genomic index on a
computer with sufficient RAM, and distribute that index to be used by
computers with less RAM.

If you want to build your genomic databases with more than one k-mer
size, you can re-run gmap_build with different values of -k.  This
will overwrite only the identical files from the previous runs.  You
can then choose the k-mer size at run-time by using the -k flag for
either GMAP or GSNAP.



4.  Setting up to build a GMAP/GSNAP database (more complex cases)
==================================================================

If gmap_build works for you, you can skip to section 5.  Otherwise, if
you have more complicated needs than gmap_build can handle, there is a
more general build tool called gmap_setup, which creates a Makefile
with this command

    gmap_setup -d <genome> [-k <kmer size>] <fasta_file>...

and then has you run a few make commands, based on the directions it
provides.  Again, you can type "gmap_setup --help" to see the full set
of options.

Note that the term

    <fasta_file>... 

above indicates that multiple files can be listed.  The files can be
in any order, and the contigs can be in any order within these files.
By default, the GMAP setup process will sort the contigs and
chromosomes into their appropriate "chrom" order.  For the human
genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by
all other chromosomes in numeric/alphabetical order.  If you don't
want this sort, provide the "-s none" flag to gmap_setup or
gmap_build.  Other sort options besides "none" and "chrom" are "alpha"
and "numeric-alpha".

We show the full set of gmap_setup options under item 4d below, but
we first discuss some specific situations for using the program.


4a.  Chromosomes represented as contig pieces
=============================================

If your FASTA entries consist of contigs, each of which has a mapping
to a chromosomal region in the header, you may need to add the -C flag
to gmap_setup, like this

    gmap_setup -d <genome> -C <fasta_file>...

Then gmap_setup will try to parse a chromosomal region from each
header.  The program knows how to parse the following patterns:

    chr=1:217281..257582  [may insert spaces around '=', or omit '=' character]
    chr=1                 [may insert spaces around '=', or omit '=' character]
    chromosome 1                                                         [NCBI .mfa format]
    chromosome:NCBI35:22:1:49554710:1                                    [Ensembl format]
    /chromosome=2                                                        [Celera format]
    /chromosome=2 /alignment=(88840247-88864134) /orientation=rev        [Celera format]
    chr1:217281..257582
    chr1                  [may insert spaces after 'chr']

If only the chromosome is specified, without coordinates, the program
will assign its own chromosomal coordinates by concatenating the
contigs within each chromosome.  If gmap_setup cannot figure out the
chromosome, it will assign it to chromosome "NA".


4b.  Using an MD file
=====================

Another possibility is that your FASTA entries consist of contigs,
each of which has mapping information in an external file.  Genomes
from NCBI typically include an ".md" file (like seq_contig.md) that
specifies the chromosomal coordinates for each contig.  To use this
information, provide the -M flag to gmap_setup, like this

    gmap_setup -d <genome> -M <mdfile> <fasta_file>...

The program will then try to parse the mdfile (which often changes
formats) and verify with you which columns contain the contig names
and chromosomal coordinates.


4c.  Compressed FASTA files or files requiring processing
=========================================================

If your genome files don't satisfy any of the cases above, you may
need to write a small script that pipes the sequences in FASTA format
to gmap_setup.  You can tell gmap_setup about your script with the -E
flag, like this

    gmap_setup -d <genome> -E 'gunzip -c chr*.gz'
    gmap_setup -d <genome> -E 'cat *.fa | ./add-chromosomal-info.pl'

You can think of the command as a Unix pipe for processing each FASTA
file before it is read by gmap_setup.


4d.  General use of gmap_setup program
======================================

Any of the steps above (4a, 4b, or 4c) will create a Makefile, called
Makefile.<genome>.  You will then use this Makefile to build a
GMAP/GSNAP database.  You will be prompted to use this Makefile
through the following commands:

    make -f Makefile.<genome> coords
    make -f Makefile.<genome> gmapdb
    make -f Makefile.<genome> install

Note that older versions of GMAP allowed the building of genomic
databases containing lower-case characters by doing "make -f
Makefile.<genome> gmapdb_lc" or "make -f Makefile.<genome>
gmapdb_lc_masked", but these will not work with GSNAP, and I am not
certain if these still work with the most recent GMAP either, so they
are not currently supported.


The first step in using this Makefile is to create a file called
coords.<genome>.  You may manually edit this file, if you wish, before
proceeding with the rest of the Makefile steps.  The coords file
contains one contig per line, in the following format:

<contig_name>	<chromosomal_mapping>	<optional_strain>

where the chromosomal_mapping is in the form
<chr_name>:<start>..<end>.  Here are some examples:

NT_077911.1	1:217281..257582
NT_091704.1	22U:1..166566

If you want the contig to be inserted as its reverse complement, then
list the coordinates in the reverse direction (starting with the
higher number), like this:

NT_039199.1	1:61563373..61273712

You may delete lines or comment them out with a '#' character, which
will effectively omit those contigs from your genome build.  You may
also change chromosomal assignments (in column 2) at this stage.

Note: Previous versions of GMAP allowed you to specify alternate
strains in column 3, but this feature added too much complexity and is
no longer supported.


You then will run "make -f Makefile.<genome> gmapdb".  This creates a
compressed version of the genome, in the file <genome>.genomecomp,
which can hold only the standard, upper-case A, C, G, T, N, and X
characters.  It converts all lower-case characters to upper-case, and
all non-ACGTNX characters to 'N'.  This command also creates a hash
table of the genome, with files that end with "gammaptrs",
"offsetscomp", and "positions".

Finally, running "make -f Makefile.<genome> install" will place all
database files in a subdirectory specified by your "-d" flag under the
directory specified either by the "-D" flag or, if not specified, the
value of --with-gmapdb you provided at configure time.



Running GMAP
============

To see the full set of options, type "gmap --help".  The following are
some common examples of usage.  For more examples, see the document
available at http://www.gene.com/share/gmap/paper/demo-slides.pdf

For each of the examples below, we assume that you have installed a
genome database called <genome> in your GMAPDB directory.  (If your
database is located elsewhere, you can specify the -D flag to gmap or
set the environment variable GMAPDB to point to that directory.)

* Mapping only: To map one or more cDNAs in a FASTA file onto a
  genome, run GMAP as follows:

    gmap -d <genome> <cdna_file>

* Mapping and alignment: If you want to map the cDNAs to a genome,
  and show the full alignment, provide the -A flag:

    gmap -d <genome> -A <cdna_file>


* Alignment only: To align one or more cDNAs in a FASTA file onto a
  given genomic segment (also in a FASTA file), use the -g flag
  instead of the -d flag:

    gmap -g <genome_segment> -A <cdna_file>


* Batch mode: If you have a large number of cDNAs to run, and you have
  sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B
  5" option.  Details for these options are provided by running "gmap
  --help".

    gmap -d <genome> -B 5 -A <cdna_file>


* Multithreaded mode: If your machine has several processors, you can
  make batch mode run even faster by specifying multiple threads with
  the -t flag:

    gmap -d <genome> -B 5 -A -t <nthreads> <cdna_file>

  Note that with multiple threads, the output results will appear in
  random order, depending on which thread finishes its computation
  first.  If you wish your output to be in the same order as the 
  input cDNA file, add the '-O' (letter O, not the number 0) flag
  to get ordered output.

  Guidelines: The -t flag specifies the number of computational
  threads.  In addition, if your machine supports threads, GMAP also
  uses one thread for reading the input query sequences, and one
  thread for printing the output results.  Therefore, the total number
  of threads will be 2 plus the number you specify.  The program will
  work optimally if it uses one thread per available processor.  If
  you specify too many threads, you can cause your computer to thrash
  and slow down.  Note that other programs running on your computer
  also need processors.


* Compressed output: If you want to store the alignment results in a
  compressed format, use the -Z flag.  You can uncompress the results
  by using the gmap_uncompress.pl program:

    gmap -d <genome> -Z <cdna_file> > x
    cat x | gmap_uncompress


Building map files
==================

This package includes an implementation of interval index trees
(IITs), which permits efficient lookup of interval information.  The
gmap program also allows you (with its -m flag) to look up pre-mapped
annotation information that overlaps your query cDNA sequence.  These
interval index trees (or map files) are built using the iit_store
program included in this package.  To build a map file, do the
following:

Step 1: Put your map information for a given genome into a map file
with the following FASTA-like format:
   
    >label coords optional_tag
    optional_annotation (which may be zero, one, or multiple lines)

For example, the label may be an EST accession, with the coords
representing its genomic position.  Labels may be duplicated if
necessary.

The coords should be of the form

    chr:position
    chr:startposition..endposition

The term "chr:position" is equivalent to "chr:position..position".  If
you want to indicate that the interval is on the minus strand or
reverse direction, then <endposition> may be less than
<startposition>.

Tags are very general and can be used for a variety of purposes.  For
example, you could 


Step 2: Run iit_store on this map file as follows

    cat <mapfile> | iit_store -o <mapname>

The program will create a file called <mapname>.iit.

Now you can retrieve this information with iit_get

    iit_get <mapname>.iit <coords>

where <coords> has the format "chr:position" or
"chr:startposition..endposition".  The iit_get program has other
capabilities, including the ability to retrieve information by label,
like this:

    iit_get <mapname>.iit <label>

More details can be found by running "iit_get --help".

If you plan to use this map information frequently, you may want to
place it with its corresponding genome for future use.  In each
GMAP/GSNAP database, there is a subdirectory for storing map files,
like this

    /path/to/gmapdb/<genome>/<genome>.maps/

(If you don't remember where your default gmap directory is, run "gmap
--version" to find it.)  If you put your <mapname>.iit file into this
maps subdirectory, you can get additional functionality.  First, you
can run the program get-genome, which is used mainly for getting
genome sequence, to get map information instead, like this

    get-genome -d <genome> -m <mapname> <coords>

Second, you can use GMAP with the -m flag to retrieve map information
that corresponds to a given cDNA sequence like this

    gmap -d <genome> -m <mapname> <cdna_file>

Finally, GMAP and the IIT utilities support the GFF3 format.  GMAP can
generate its results in GFF3 format, and iit_store can parse GFF3
files using its -G and -l flags.  More details about iit_store can be
found by doing "iit_store --help".


Running GSNAP
=============

GSNAP uses the same database as GMAP does, so you will need to process
the genome using gmap_setup as explained above, if you haven't done
that already.

To see the full set of options for GSNAP, type "gsnap --help".  A key
parameter you will need to set is the "-m" flag, which is the maximal
score you will allow per read (or each end of a paired-end read).  The
score equals the number of mismatches, plus penalties for indels and
local or distant splicing, if any.  If you do not set a value for
"-m", then GSNAP will pick a value, depending on the length of each
read, that will allow it to run fairly quickly.

For DNA-Seq, the automatic setting should be fine, unless you need to
accommodate penalty values for indels or splicing, or your reads are
of poor quality.

For RNA-Seq, in previous versions, we recommended a moderately high
value of -m, such as 10 or so, to handle alignments that cross an
intron-exon boundary.  But now that GSNAP can find terminal alignments
and has GMAP integrated in its algorithm, it is better to select a
small value for -m, such as the default value or something small like
4 or 5 for a 75-bp read.

Input to GSNAP should be either in FASTQ or FASTA format.  The FASTQ
input may include quality scores, which will then be included in SAM
output, if that output format is selected.  For single-end reads, the
FASTQ file may be piped into GSNAP, or given as its command-line
argument, like this

    cat <fastq_file> | gsnap -d <genome>

or

    gsnap -d <genome> <fastq_file>


For paired-end reads, the two corresponding FASTQ files should be
given as command-line arguments, like this

    gsnap -d <genome> <fastq_file_1> <fastq_file_2>

A pipe cannot work since GSNAP needs to access both FASTQ files in
parallel.  The reads in FASTQ files may have varying lengths, if
desired.

GSNAP also has the ability to deal with files compressed with gzip, if
the configure script at compile time can find a zlib library in your
system (see Note 3 in the section above about building and installing
GMAP and GSNAP).  If so, and your files are gzipped, you can then read
in gzipped files directly like this

    gsnap --gunzip -d <genome> <fastq.gz>

for single-end reads, or

    gsnap --gunzip -d <genome> <fastq_1.gz> <fastq_2.gz>

for paired-end reads.


For FASTA format, you should include one line per read (or end of a
paired-end read).  The same FASTA file can have a mixture of
single-end and paired-end reads of varying lengths, if desired.

Single-end reads:

Each FASTA entry should contain one short read per line, like this

>Header information
AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA

Each short read can have a different length.  However, the entire read
needs to be on a single line, and may not wrap around multiple lines.
If it extends to a second line, GSNAP will think that the read is
paired-end.


Paired-end reads:

Each FASTA entry should contain two short reads, one per line, like
this

>Header information
AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG

By default, the program assumes that the second end is in the reverse
complement direction compared with the first end.  If they are in the
same direction, you may need to use the --circular-input (or -c) flag.

GSNAP and GMAP can also read an extended FASTA format that include
quality scores, which look like this

    >Header information
    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
    +
    <quality scores>

for single-end reads, or

    <Header information
    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
    +
    <quality scores>

for the second-end of a paired-end read.  In addition, GSNAP can read
an extended FASTA format for paired-end reads, like this:

    >Header information
    AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
    GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
    +
    <quality scores 1>
    <quality scores 2>

This extended FASTA format is useful if paired-end information needs
to be piped into GSNAP via stdin.


SAM output format
=======================

GSNAP can generate SAM output format, by providing the "-A sam" flag
to GSNAP.  In addition, GMAP can also print its alignments in SAM
output, using the "-f samse" or "-f sampe" options, for single-end or
paired-end data.  The sampe option will generate SAM flags to indicate
whether the read is the first or second end of a pair, which requires
that you provide GMAP with an extended FASTA format having a ">" or
"<" character in the header to indicate that information.  However,
the sampe option will change only the SAM flags, and not change the
underlying alignment algorithm.  GMAP does not know how to find
concordance between paired-end reads like GSNAP does.

GSNAP provides some special SAM flags as follows:

XQ: A non-normalized mapping quality score

X2: The second best XQ score among all multimapping alignments for a
given read.  If there is only a single alignment, this value is 0.

XB: Prints the barcode extracted from the end of the read.  Applies only
if --barcode-length is not 0.

XP: Prints the primer inferred from a paired-end read.  Applies only
if the --adapter-strip flag is specified.

XS: Prints the strand orientation (+ or -) for a splice.  Appears only
if splicing is allowed (-N or -s flag provided), and only for reads
containing a splice.  The value "+" means the expected GT-AG, GC-AG,
or AT-AC dinucleotide pair is on the plus strand of the genome, and
"-" means the dinucleotides are on the minus strand.  If the
orientation is not obvious, because the dinucleotides do not match
GT-AG, GC-AG, AT-AC, or their complements, then the program applies a
probabilistic splice model to determine the orientation.

XT: Prints the intron dinucleotides and splice probabilities around a
distant splicing event (genomic deletion, inversion, scramble, or
translocation).


GSNAP output format
===================

By default, GSNAP prints its output in a FASTA-like format, which we
developed before we incorporated the SAM format.  The default GSNAP
output has some advantages over SAM output, especially for debugging
purposes.

the orientation of splice junctions, in indicating translocations that
occur in a single read, and in reflecting information about known
splice sites or known SNPs.  

Here is some output from GSNAP on a paired-end read:

>GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
 GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGct-----------------------------------   1..39   +9:139128263..139128301 start:0..acceptor:0.99,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon4/13  segs:2,align_score:2  pair_score:5,pair_length:112
,-------------------------------------acGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   40..76  +9:139128516..139128552 donor:0.96..end:0,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon3/13
 
<CTTCGCCAACAACTCGGGCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACCGC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
 CTTCGCCAACAACTCGGcCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACgtg   1..73   -9:139128588..139128516 start:0..end:3,sub:3+1=4      segs:1,align_score:3  pair_score:5,pair_length:112
 
Each end of a read gets its own block, with the first read starting
with ">" and the second read for paired-end reads starting with "<".
The block starts with a header line that has in column 1, the query
sequence in its original direction (and with lower-case preserved if
any); in column 2, the number of hits for that query and if the read
is paired-end, the relationship between the ends (as discussed in the
next paragraph); and in column 3, the accession number for the query.

The two ends of a paired-end read can have the following
relationships: "concordant", "paired", or "unpaired".  A paired-end
read is concordant if the ends align to the same chromosome, in the
expected relative orientations, and having an inferred insert length
greater than zero and within the "--pairmax" parameter.  The inferred
insert length is the distance from the end of the first-end alignment
to the start of the second-end alignment, plus the read lengths of the
two ends.  There may be more than one concordant alignment for a given
read, and if so, the alignments for each end are reported in
corresponding order.

If a concordant relationship cannot be found, then the program will
report any paired relationships it can find.  A paired alignment
occurs when the two ends align to the same chromosome, but fail some
criterion for concordance.  There are different subtypes of paired
alignments, depending on which criterion is violated.  If the
orientations are opposite what is expected, the paired subtype is
"inversion".  If they are in the expected orientation, but the
distance is greater than the "--pairmax" parameter, then the paired
subtype is "toolong".  If they are in the expected orientation, but
the inferred insert length appears to be negative, then the paired
subtype is "scramble".  In GSNAP output, a paired subtype is shown in
a label called "pairtype", which can have the values
"pairtype:inversion", "pairtype:toolong", and "pairtype:scramble".

Otherwise, if neither a concordant nor paired alignment can be found,
then the program will align each end separately, and report the
relationship as being "unpaired".

GSNAP can find translocation splices within a single end of a read,
but it tries to be conservative about reporting them.  If there is any
alignment that does not involve such a translocation, then it will not
report the translocation.  It therefore reports translocation splices
only when no other alignment is found within the concordant, paired,
or unpaired categories.  Therefore, such results are listed in the
header as having "(transloc)" appear after the "concordant", "paired",
or "unpaired" result type.

After the query line, each of the genomic hits is shown, up to the
'-n' parameter.  If too many hits were found (more than the '-n'
parameter), the behavior depends on whether the "--quiet-if-excessive"
flag is given to GSNAP.  If not, then the first n hits will be printed
and the rest will not be printed.  If the "--quiet-if-excessive" flag
is given to GSNAP, then no hits will be printed if the number exceeds
n.

Each of the genomic hits contains one or more alignment segments,
which is a region of continuous one-to-one correspondence (matches or
mismatches) between the query and the genome.  Multiple segments occur
when the alignment contains an insertion, deletion, or splice.  The
first segment is marked by a space (" ") at the beginning of the line,
while the second and following segments are marked by a comma (",") at
the beginning of the line.  (In the current implementation of GSNAP
that allows only a single indel or splice, the number of segments is
at most two.)

The segments contain information in tab-delimited columns as follows:

Column 1: Genomic sequence with matches in capital letters, mismatches
in lower-case letters, and regions outside the segment with dashes.
For deletions in the query, the deleted genomic sequence is also
included in lower case.  For spliced reads, the two dinucleotides at
the intron ends are included in lower case.

Column 2: Range of query sequence aligned in the segment.  Coordinates
are inclusive, with the first nucleotide considered to be position 1.

Column 3: Range of genomic segment aligned, again with inclusive
coordinates, with the first nucleotide in each chromosome considered
to be position 1.  Plus and minus strands are marked with a "+" or "-"
sign.

Column 4: Segment information, delimited by commas.  The first item
reports on the ends of the segment, which can be of type "start",
"end", "ins", "del", "donor", "acceptor", or "term".  After "start"
and "end", we report the number of nucleotides clipped or trimmed from
the segment.  In our example above, "end:3" means that 3 nucleotides
should be trimmed from the end.  Trimming finds a local maximum of
matches to mismatches from the end and is computed only if the "-T"
flag is specified, and the value for "-T" limits the amount of
trimming allowed.  After "ins" and "del", we report the number of
nucleotides that were inserted or deleted in the query relative to the
genome.  After "donor" or "acceptor", we report the probability of the
splice site, based on the MaxEnt model.  The "term" label indicate a
terminal segment, where the entire read could not be aligned, but more
than half of the read could be aligned from either end.

Each segment will also show after the "sub" tag, he number of
mismatches in that segment including the part that is trimmed, if any.
If SNP-tolerant alignment is chosen, then the number of SNPs is also
shown (see details below under SNP-tolerant alignment).  Other
information may also be included with the segment information, such as
the orientation and distance of the splice or known splice labels, if
appropriate flags and information are given to GSNAP.  Splices are
marked with a splice_type, which can be "consistent", "inversion",
"scramble", or "translocation".  A "translocation" splice includes
splices on the same chromosome where the splice distance exceeds the
parameter for localsplicedist.

Column 5: Alignment or hit information, delimited by commas.  For the
first segment in a hit (the one starting with a space), this column
provides the number of segments (denoted by "segs:") and the score of
the alignment (denoted by "align_score:").

Column 6: Pair information (for paired-end reads only).  For the first
segment in a hit (with the same information repeated on both ends of a
concordant pair), this column provides the score of the pair (which is
the sum of the alignment scores) and the inferred length of the insert
(ignoring splices within each segment, but not between segments).  


Detecting known and novel splice sites in GSNAP
===============================================

GSNAP can detect splice junctions in individual reads.  You can detect
splices using a probabilistic model using the --novelsplicing (or -N)
flag.  You can also detect splices from a set of splice sites that you
provide, using the --splicesites (or -s) flag.  You may specify both
flags, which will report splice junctions involving both known and
novel sites.

Output for a splicing junction will look like this:

>TCCGTGACGTGGATTGGTGCTGCACCCCTCATC      1       Header
 TCCGTGACGTGGATTGgt---------------      1..16   +19:56050054..56050069  start:0..donor:0.99,splice_dist:1238,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon1/5|NM_001030047.KLK3.exon1/5|NM_001030048.KLK3.exon1/5|NM_001030049.KLK3.exon1/6|NM_001030050.KLK3.exon1/2       
,--------------agGTGCTGCACCCCTCATC      17..33  +19:56051308..56051324  acceptor:0.99..end:0,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon2/5|NM_001030047.KLK3.exon2/5|NM_001030048.KLK3.exon2/5|NM_001030049.KLK3.exon2/6|NM_001030050.KLK3.exon2/2
 
After the "donor:" or "acceptor:" splice site type, the model score
probability is given, even if the splice site is known.  For known
splice sites, the "label:" field will provide information about the
site.  If there is more than one known splice site at a genomic
position, the labels are separated by a "|" delimiter.

There are several advantages to specifying a database of known splice
sites.  First, GSNAP will then be able to detect splicing involving
atypical splice sites, that would otherwise give low scores using its
probabilistic model.  A known splice site is treated as if its model
probability is 1.0.  Second, GSNAP can find splicing involving short
exons.  Such cases have a single end aligning to three exons,
separated by two introns.  Third, GSNAP can identify splicing at the
ends of reads with greater sensitivity, even if they have short
overlaps onto the next exon.  Fourth, GSNAP can detect known long
splices, because expected splice lengths can be included with the
splice site information.

GSNAP allows for known splicing at two levels: at the level of known
splice sites and at the level of known introns.  At the site level,
GSNAP finds splicing between arbitrary combinations of donor and
acceptor splice sites, meaning that it can find alternative splicing
events.  At the intron level, GSNAP finds splicing only between the
set of given donor-acceptor pairs, so it is constrained not to find
alternative splicing events, only introns included in the given list.
For most purposes, I would recommend using known splice sites, rather
than known introns, unless you are certain that all alternative
splicing events are known are represented in your file.

GSNAP can tell the difference between known site-level and known
intron-level splicing based on the format of the input file.  To
perform known site-level splicing, you will need to create a file with
the following format:

>NM_004448.ERBB2.exon1 17:35110090..35110091 donor 6678
>NM_004448.ERBB2.exon2 17:35116768..35116769 acceptor 6678
>NM_004448.ERBB2.exon2 17:35116920..35116921 donor 1179
>NM_004448.ERBB2.exon3 17:35118099..35118100 acceptor 1179
>NM_004449.ERG.exon1 21:38955452..38955451 donor 783
>NM_004449.ERG.exon2 21:38878740..38878739 acceptor 783
>NM_004449.ERG.exon2 21:38878638..38878637 donor 360
>NM_004449.ERG.exon3 21:38869542..38869541 acceptor 360

Each line must start with a ">" character, then be followed by an
identifier, which may have duplicates and can have any format, with
the gene name or exon number shown here only as a suggestion.  Then
there should be the chromosomal coordinates which straddle the
exon-intron boundary, so one coordinate is on the exon and one is on
the intron.  (Coordinates are all 1-based, so the first character of a
chromosome is number 1.)  Finally, there should be the splice type:
"donor" or "acceptor".  You may optionally store the intron distance
at the end.  GSNAP can use this intron distance, if it is longer than
its value for --localsplicedist, to look for long introns at that
splice site.  The same splice site may have different intron distances
in the database; GSNAP will use the longest intron distance reported
in searching for long introns.
                                                                                                
Note that the chromosomal coordinates are in the sense direction.
Therefore, genes on the plus strand of the genome (like NM_004448) have
the coordinates in ascending order (e.g., 35110090..35110091).
Genes on the minus strand of the genome (like NM_004449) have the
coordinates in descending order (e.g., 38955452..38955451).
                                                                                                
On the other hand, to perform known intron-level splicing, you will need
to create a file with the following format:

>NM_004448.ERBB2.intron1 17:35110090..35116769
>NM_004448.ERBB2.intron2 17:35116920..35118100
>NM_004449.ERG.intron1 21:38955452..38878739
>NM_004449.ERG.intron2 21:38878638..38869541

Again, coordinates are 1-based, and specify the exon coordinates
surrounding the intron, with the first coordinate being from the donor
exon and the second one being from the acceptor exon.


There are several ways to help you generate these files.  First, if
you have a GTF file, you can use the included programs gtf_splicesites
and gtf_introns like this:

    cat <gtf file> | gtf_splicesites > foo.splicesites
    cat <gtf file> | gtf_introns > foo.introns

Second, if you retrieve an alignment tracks from UCSC, like this:

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz

if you are aligning to genome hg18, or

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

if you are aligning to genome hg19, you can process this track using
the included program psl_splicesites or psl_introns, like this:

    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites
    gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns

Note that alignment tracks in UCSC sometimes have an extra column on
the left.  The "-s" flag allows you to indicate how many columns
should be skipped.

Once you have built this splicesites or introns file, you process it
as a map file (see "Building map files" above), by doing

    cat foo.splicesites | iit_store -o <splicesitesfile>, or
    cat foo.introns | iit_store -o <intronsfile>

If you want to include more than one track, you can do this:

    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo
    gunzip -c knownGene.txt.gz | psl_splicesites > bar
    cat foo bar | iit_store -o <splicesitesfile>


A third way to build a known splicesites or known introns file is
useful if you have cDNA sequences rather than an alignment track, or
if you do not trust the alignment track and prefer to use cDNA
sequences.  GMAP has an option "-f splicesites" that finds splice
sites in cDNA sequences and reports them in the correct splicesite
format.  Likewise, GMAP can build an intron file, with the option "-f
introns".

When processing known cDNA sequences, you should also run GMAP with
the "-n 1" flag, so you get the best alignment, and with the "-z
sense_force" or "-z sense_filter" flag.  The sense_force option will
help GMAP know that the introns in your cDNA sequences are in the
correct GT-AG sense, and is applicable when you have a high quality
set of cDNA sequences.  The sense_filter option will allow GMAP to try
either sense or antisense, and to filter out sequences that appear to
be antisense; this is applicable if you are uncertain about the
validity of your cDNA sequences.

Again once you have built either a known splicesites or known introns
file, you process it as a map file by doing

    cat <file> | iit_store -o <splicesitesfile>, or
    cat <file> | iit_store -o <intronsfile>
                                                                                                
which creates <splicesitesfile>.iit or <intronsfile>.iit.


Regardless of how you built <splicesitesfile>.iit or <intronsfile>.iit,
you put it in the maps subdirectory by doing
                                                                                                
    cp <splicesitesfile>.iit /path/to/gmapdb/<genome>/<genome>.maps, or
    cp <intronsfile>.iit /path/to/gmapdb/<genome>/<genome>.maps
                                                                                                
Then, you may use the file by doing this:
                                                                                                
    gsnap -d <genome> -s <splicesitesfile> <shortreads>, or
    gsnap -d <genome> -s <intronsfile> <shortreads>, or
                                                                                                

SNP-tolerant alignment in GSNAP
===============================

GSNAP has the ability to align to a reference space of all possible
major and minor alleles in a set of known SNPs provided by the user.
This ability is provided by the -v flag, and produces output like
this:

>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1       
 ATGGTAATCCTGCTCAGTAGGAGAGGAACCCCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3

The notation "sub:1+0=1" indicates that the SNP-tolerant alignment has
one mismatch ("sub:1") and zero minor SNP alleles ("+0"), for a total
of one differences ("=1") relative to the reference genome with all
major alleles.  Likewise, the notation "sub:1+2=3" indicates one
SNP-tolerant mismatch, two minor SNP alleles, and 3 mismatches
relative to the reference sequence with all major alleles.

Note that by default, GSNAP shows only differences relative to both
the reference and alternate genomes in lower case.  If you want to
show differences relative to the reference genome in lower case, you
will need to provide the flag --show-refdiff, which would give the
output above as:

>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1       
 ATGGTAATCCTGCTCAGTAgGAGAGGAACCcCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3

For SAM output format, all differences from the reference sequence are
shown in the NM and MD fields, although the alignment scoring and
mapping qualities are based on a SNP-tolerant alignment.

To specify a set of known SNPs, you will need to create a file with the
following format:

>rs62211261 21:14379270 CG
>rs62211262 21:14379281 CG

Each line must start with a ">" character, then be followed by an
identifier (which may have duplicates).  Then there should be the
chromosomal coordinate of the SNP.  (Coordinates are all 1-based, so
the first character of a chromosome is number 1.)  Finally, there
should be the two possible alleles.  (Previous versions required that
these be in alphabetical order: "AC", "AG", "AT", "CG", "CT", or "GT",
but that is no longer a requirement.)  These alleles must correspond
to the possible nucleotides on the plus strand of the genome.  If the
one of these two letters does not match the allele in the reference
sequence, that SNP will be ignored in subsequent processing as a
probable error.

GSNAP also supports the idea of a wildcard SNP.  A wildcard SNP allows
all nucleotides to match at that position, not just a given reference
and alternate allele.  It is essentially as if an "N" were recorded at
that genomic location, although the index files still keep track of
the reference allele.  To indicate that a position has a wildcard SNP,
you can indicate the genotype as "WN", where "W" is the reference
allele.  Another indication of a wildcard SNP is to provide two
separate lines at that position with the genotypes "WX" and "WY",
where "W" is the reference allele and "X" and "Y" are two different
alternate alleles.

To help you generate the file, this package includes a program called
dbsnp_iit that can process the dbsnp files from UCSC.  First, you will
need to get a dbsnp file, like this

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz

For versions before snp132, you may also want to exclude exceptions,
which will require this file:

    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130Exceptions.txt.gz

You can then process these files using dbsnp_iit, like this:

    gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt.gz] > <snpfile>.txt, or
    gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt] > <snpfile>.txt

For versions starting with snp132, the exceptions are included in
column 19 of the snp file, so the -e flag is not necessary.  The
dbsnp_iit program will read the exceptions from that column.

The option "-w <weight>" makes use of the dbSNP item weight, a value
from 1 to 3, where lower weight means higher confidence.  Items will
be included if the item weight is the given value <weight> or less.
The default value of -w is 1, which is the criterion UCSC uses to
build its ambiguous version of the genome.  To allow all item weights,
specify "-w 3".

SNPs have various types of exceptions, as provided either by the "-e
<exceptions_file>" flag, or starting with snp132, in the snp file
itself.  By default, dbsnp_iit will exclude all types of exceptions.
However, if you want to include some types of exceptions, you will
need to modify the dbsnp_iit program (written in Perl) to indicate a
"1" for the type of exception you want to include.

Once you have built a <snpfile>.txt file using dbsnp_iit, you create
an IIT file by doing this
                                                                                                
    cat <snpfile>.txt | iit_store -o <snpfile>
                                                                                                
which creates <snpfile>.iit.  If you wish to store this file for other
people to use, you can then put it in a central place:

    cp <snpfile>.iit /path/to/gmapdb/<genome>/<genome>.maps

Or you can keep it in your own directory for your own use.  Then you
need to create a reference space index and compressed genome by doing

    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile> <snpfile>.iit

If you specify the [-V <snpdirectory>] option, then the resulting SNP
index files are placed in the given directory.  Otherwise, if you
don't specify this, then the SNP files are saved with the reference
index files.

If your <snpfile>.iit file is already in a <genome>.maps subdirectory,
then you can simply run

    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile>

and the program will find the indicated <snpfile>.iit file.
Additional options for snpindex can be found by doing "snpindex
--help".


Finally, you can perform SNP-tolerant alignment by doing

    gsnap -d <genome> [-V <snpdirectory>] -v <snpfile> <shortreads>

You can also retrieve snp information for genomic segments by running
get-genome with the -v and -f flags.  For more details, run
"get-genome --help".


Alignment of reads from bisulfite-treated DNA in GSNAP
======================================================

GSNAP has the ability to align reads from bisulfite-treated DNA, which
converts unmethylated cytosines to uracils that appear as thymines in
reads.  GSNAP is able to identify genomic-T to read-C mismatches, and
produces output like this:

>CTACGTcGTAGACATATTGATTCAGAATTTGAAGTTTAGCTAGATCTTAc     1       Read
 C.ACG.tGTAGACATA.TGATTCAGAAT.TGAAGTTTAGCTAGA.C.TAg     1..50   sub:2   +9:1000000..1000049
 
As with all GSNAP output, the first line is the query, and the ones
afterward represent genomic segments.  The "." symbol indicates an
unmethylated C in the genome that appears as a T in the read.
Mismatches are shown by lower case characters in the genomic segment.
In position 6, we see an example of a genomic-T that appears as a
read-C, representing a mismatch.  The last position also shows a
mismatch between a genomic-G and read-C.

To process reads from bisulfite-treated DNA, you will first need to
create the necessary index files by doing

    cmetindex -d <genome>

Then, you can align the reads by doing

    gsnap -d <genome> --mode=cmet

on your set of short reads.


RNA-editing tolerance in GSNAP
==============================

Just as GSNAP has a program cmetindex and a mode called "cmet" for
tolerance to C-to-T changes, it can be tolerant to A-to-G changes
using the program atoiindex and a mode called "atoi".  This mode is
designed to facilitate alignments that are tolerant to RNA editing
where A's are converted to I's, which appear as G's to a sequencer.

To process reads under RNA-editing tolerance, you will first need to
create the necessary index files by doing

    atoiindex -d <genome>

Then, you can align the reads by doing

    gsnap -d <genome> --mode=atoi

on your set of short reads.