Skip to content

arq5x/bamtools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

--------------------------------------------------------------------------------
README : BAMTOOLS
--------------------------------------------------------------------------------

BamTools: a C++ API & toolkit for reading/writing/manipulating BAM files.

I.   Introduction
     a. The API
     b. The Toolkit

II.  Installation

III. Usage
     a. The API
     b. The Toolkit

IV.  License

V.   Acknowledgements

VI.  Contact

--------------------------------------------------------------------------------
I. Introduction:
--------------------------------------------------------------------------------

BamTools provides both a programmer's API and an end-user's toolkit for handling
BAM files.

----------------------------------------
Ia. The API:
----------------------------------------

The API consists of 2 main modules: BamReader and BamWriter. As you would
expect, BamReader provides read-access to BAM files, while BamWriter handles
writing data to BAM files. BamReader provides the interface for random-access
(jumping) in a BAM file, as well as generating BAM index files.

BamMultiReader is an extra module that allows you to manage multiple open BAM
files for reading. It provides some validation & bookkeeping under the hood to
keep all files sync'ed up for you.

Additional files used by the API:

 - BamAlignment.* : implements the BamAlignment data structure

 - BamAux.h       : contains various constants, data structures and utility
                    methods used throught the API.

 - BamIndex.*     : implements both the standard BAM format index (".bai") as
                    well as a new BamTools-specific index (".bti").

 - BGZF.*         : contains our implementation of the Broad Institute's BGZF
                    compression format.

----------------------------------------
Ib. The Toolkit:
----------------------------------------

If you've been using the BamTools since the early days, you'll notice that our
'toy' API examples (BamConversion, BamDump, BamTrim,...) are now gone.  We have
dumped these in favor of a suite of small utilities that we hope both
developers and end-users find useful:

usage: bamtools [--help] COMMAND [ARGS]

Available bamtools commands:

        convert         Converts between BAM and a number of other formats
        count           Prints number of alignments in BAM file(s)
        coverage        Prints coverage statistics from the input BAM file
        filter          Filters BAM file(s) by user-specified criteria
        header          Prints BAM header information
        index           Generates index for BAM file
        merge           Merge multiple BAM files into single file
        random		Select random alignments from existing BAM file(s)
        sort            Sorts the BAM file according to some criteria
        split		Splits a BAM file on user-specifed property, creating a
                            new BAM output file for each value found
        stats           Prints some basic statistics from input BAM file(s)

See 'bamtools help COMMAND' for more information on a specific command.

--------------------------------------------------------------------------------
II. Installation :
--------------------------------------------------------------------------------

----------------------------------------
IIa. Get CMake
----------------------------------------

BamTools has been migrated to a CMake-based build system.  We believe that this 
should simplify the build process across all platforms, especially as the 
BamTools API moves into a shared library (that you link to instead of compiling 
lots of source files directly into your application). CMake is available on all 
major platforms, and indeed comes *out-of-the-box* with many Linux distributions.

To see if you have CMake (and which version), try this command:

  $ cmake --version

BamTools requires CMake version >= 2.6.4. If you are missing CMake or have an 
older version, check your OS package manager (for Linux) or download it here:
http://www.cmake.org/cmake/resources/software.html .

----------------------------------------
IIb. Build BamTools 
----------------------------------------

Ok, now that you have CMake ready to go, let's build BamTools.  A good
practice in building applications is to do an out-of-source build, meaning
that we're going to set up an isolated place to hold all the intermediate 
installation steps.

In the top-level directory of BamTools, type the following commands:

  $ mkdir build
  $ cd build
  $ cmake ..

Windows users:
This creates a Visual Studio solution file, which can then be built to create 
the toolkit executable and API DLL's.

Everybody else:
After running cmake, just run:

  $ make

Then go back up to the BamTools root directory.

  $ cd ..

----------------------------------------
IIIb. Check It 
----------------------------------------

Assuming the build process finished correctly, you should be able to find the 
toolkit executable here:

  ./bin/

The BamTools-associated libraries will be found here:

  ./lib/

--------------------------------------------------------------------------------
III. Usage :
--------------------------------------------------------------------------------

** General usage information - perhaps explain common terms, point to SAM/BAM
spec, etc **

----------------------------------------
IIIa. The API
----------------------------------------

The API, as noted above, contains 2 main modules - BamReader & BamWriter - for
dealing with BAM files. Alignment data is made available through the
BamAlignment data structure.

A simple (read-only) scenario for accessing BAM data would look like the
following:

        // open our BamReader
        BamReader reader;
        reader.Open("someData.bam", "someData.bam.bai");

        // define our region of interest
        // in this example: bases 0-500 on the reference "chrX"
        int id = reader.GetReferenceID("chrX");
        BamRegion region(id, 0, id, 500);
        reader.SetRegion(region);

        // iterate through alignments in this region,
        // ignoring alignments with a MQ below some cutoff
        BamAlignment al;
        while ( reader.GetNextAlignment(al) ) {
            if ( al.MapQuality >= 50 )
                // do something
        }

        // close the reader
        reader.Close();

To use this API in your application, you simply need to do 3 things:

    1 - Build the BamTools library (see Installation steps above). 

    2 - Import BamTools API with the following lines of code
        #include "BamReader.h"     // (or "BamMultiReader.h") as needed
        #include "BamWriter.h"     // as needed
        using namespace BamTools;  // all of BamTools classes/methods live in
                                   // this namespace

    3 - Link with '-lbamtools' ('l' as in Lima).

You may need to modify the -L flag (library path) as well to help your linker 
find the (BAMTOOLS_ROOT)/lib directory.

See any included programs for more detailed usage examples. See comments in the 
header files for more detailed API documentation.

Note - For users that don't want to bother with the new BamTools shared library
scheme: you are certainly free to just compile the API source code directly into 
your application, but be aware that the files involved are subject to change.  
Meaning that filenames, number of files, etc. are not fixed.  You will also need 
to be sure to link with '-lz' for ZLIB functionality (linking with '-lbamtools' 
gives you this automatically).

----------------------------------------
IIIb. The Toolkit
----------------------------------------

BamTools provides a small, but powerful suite of command-line utility programs
for manipulating and querying BAM files for data.

--------------------
Input/Output
--------------------

All BamTools utilities handle I/O operations using a common set of arguments.
These include:

 -in <BAM file>

The input BAM files(s).

    If a tool accepts multiple BAM files as input, each file gets its own "-in"
    option on the command line. If no "-in" is provided, the tool will attempt
    to read BAM data from stdin.

    To read a single BAM file, use a single "-in" option:
    > bamtools *tool* -in myData1.bam ...ARGS...

    To read multiple BAM files, use multiple "-in" options:
    > bamtools *tool* -in myData1.bam -in myData2.bam ...ARGS...

    To read from stdin (if supported), omit the "-in" option:
    > bamtools *tool* ...ARGS...

  -out <BAM file>

The output BAM file.

    If a tool outputs a result BAM file, specify the filename using this option.
    If none is provided, the tool will typically write to stdout.

    *Note: Not all tools output BAM data (e.g. count, header, etc.)

 -region <REGION>

A region of interest. See below for accepted 'REGION string' formats.

    Many of the tools accept this option, which allows a user to only consider
    alignments that overlap this region (whether counting, filtering, merging,
    etc.).

    An alignment is considered to overlap a region if any part of the alignments
    intersects the left/right boundaries. Thus, a 50bp alignment at position 70
    will overlap a region beginning at position 100.

    REGION string format
    ----------------------
    A proper REGION string can be formatted like any of the following examples:
        where 'chr1' is the name of a reference (not its ID)and '' is any valid
        integer position within that reference.

    To read
    chr1               - only alignments on (entire) reference 'chr1'
    chr1:500           - only alignments overlapping the region starting at
                         chr1:500 and continuing to the end of chr1
    chr1:500..1000     - only alignments overlapping the region starting at
                         chr1:500 and continuing to chr1:1000
    chr1:500..chr3:750 - only alignments overlapping the region starting at
                         chr1:500 and continuing to chr3:750. This 'spanning'
                         region assumes that the reference specified as the
                         right boundary will occur somewhere in the file after
                         the left boundary. On a sorted BAM, a REGION of
                         'chr4:500..chr2:1500' will produce undefined
                         (incorrect) results.  So don't do it. :)

    *Note: Most of the tools that accept a REGION string will perform without an
           index file, but typically at great cost to performance (having to
           plow through the entire file until the region of interest is found).
           For optimum speed, be sure that index files are available for your
           data.

 -forceCompression

Force compression of BAM output.

    When tools are piped together (see details below), the default behavior is
    to turn off compression. This can greatly increase performance when the data
    does not have to be constantly decompressed and recompressed. This is
    ignored any time an output BAM file is specified using "-out".

--------------------
Piping
--------------------

Many of the tools in BamTools can be chained together by piping.  Any tool that
accepts stdin can be piped into, and any that can output stdout can be piped
from. For example:

> bamtools filter -in data1.bam -in data2.bam -mapQuality ">50" | bamtools count

will give a count of all alignments in your 2 BAM files with a mapQuality of
greater than 50. And of course, any tool writing to stdout can be piped into
other utilities.

--------------------
The Tools
--------------------

    convert         Converts between BAM and a number of other formats
    count           Prints number of alignments in BAM file(s)
    coverage        Prints coverage statistics from the input BAM file
    filter          Filters BAM file(s) by user-specified criteria
    header          Prints BAM header information
    index           Generates index for BAM file
    merge           Merge multiple BAM files into single file
    random          Select random alignments from existing BAM file(s)
    sort            Sorts the BAM file according to some criteria
    split           Splits a BAM file on user-specifed property, creating a new
                       BAM output file for each value found
    stats           Prints some basic statistics from input BAM file(s)

----------
convert
----------

Description: converts BAM to a number of other formats

Usage: bamtools convert -format <FORMAT> [-in <filename> -in <filename> ...]
                        [-out <filename>] [other options]

Input & Output:
  -in <BAM filename>               the input BAM file(s) [stdin]
  -out <BAM filename>              the output BAM file [stdout]
  -format <FORMAT>                 the output file format - see below for
                                       supported formats

Filters:
  -region <REGION>                 genomic region. Index file is recommended for
                                       better performance, and is read
                                       automatically if it exists. See 'bamtools
                                       help index' for more details on creating
                                       one.

Pileup Options:
  -fasta <FASTA filename>          FASTA reference file
  -mapqual                         print the mapping qualities

SAM Options:
  -noheader                        omit the SAM header from output

Help:
  --help, -h                       shows this help text

** Notes **

    - Currently supported output formats ( BAM -> X )

        Format type   FORMAT (command-line argument)
        ------------  -------------------------------
        BED           bed
        FASTA         fasta
        FASTQ         fastq
        JSON          json
        Pileup        pileup
        SAM           sam
        YAML          yaml

        Usage example:
        > bamtools convert -format json -in myData.bam -out myData.json

    - Pileup Options have no effect on formats other than "pileup"
      SAM Options have no effect on formats other than "sam"

----------
count
----------

Description: prints number of alignments in BAM file(s).

Usage: bamtools count [-in <filename> -in <filename> ...] [-region <REGION>]

Input & Output:
  -in <BAM filename>                the input BAM file(s) [stdin]
  -region <REGION>                  genomic region. Index file is recommended
                                       for better performance, and is used
                                       automatically if it exists. See
                                       'bamtools help index' for more details
                                       on creating one

Help:
  --help, -h                        shows this help text

----------
coverage
----------

Description: prints coverage data for a single BAM file.

Usage: bamtools coverage [-in <filename>] [-out <filename>]

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]
  -out <filename>                   the output file [stdout]

Help:
  --help, -h                        shows this help text

----------
filter
----------

Description: filters BAM file(s).

Usage: bamtools filter [-in <filename> -in <filename> ...]
                       [-out <filename> | [-forceCompression]]
                       [-region <REGION>]
                       [ [-script <filename] | [filterOptions] ]

Input & Output:
  -in <BAM filename>                the input BAM file(s) [stdin]
  -out <BAM filename>               the output BAM file [stdout]
  -region <REGION>                  only read data from this genomic region (see
                                       README for more details)
  -script <filename>                the filter script file (see README for more
                                       details)
  -forceCompression                 if results are sent to stdout (like when
                                       piping to another tool), default behavior
                                       is to leave output uncompressed. Use this
                                       flag to override and force compression

General Filters:
  -alignmentFlag <int>              keep reads with this *exact* alignment flag
                                       (for more detailed queries, see below)
  -insertSize <int>                 keep reads with insert size that matches
                                       pattern
  -mapQuality <[0-255]>             keep reads with map quality that matches
                                       pattern
  -name <string>                    keep reads with name that matches pattern
  -queryBases <string>              keep reads with motif that matches pattern
  -tag <TAG:VALUE>                  keep reads with this key=>value pair

Alignment Flag Filters:
  -isDuplicate <true/false>         keep only alignments that are marked as
                                       duplicate [true]
  -isFailedQC <true/false>          keep only alignments that failed QC [true]
  -isFirstMate <true/false>         keep only alignments marked as first mate
                                       [true]
  -isMapped <true/false>            keep only alignments that were mapped [true]
  -isMateMapped <true/false>        keep only alignments with mates that mapped
                                       [true]
  -isMateReverseStrand <true/false> keep only alignments with mate on reverse
                                       strand [true]
  -isPaired <true/false>            keep only alignments that were sequenced as
                                       paired [true]
  -isPrimaryAlignment <true/false>  keep only alignments marked as primary
                                       [true]
  -isProperPair <true/false>        keep only alignments that passed paired-end
                                       resolution [true]
  -isReverseStrand <true/false>     keep only alignments on reverse strand
                                       [true]
  -isSecondMate <true/false>        keep only alignments marked as second mate
                                       [true]

Help:
  --help, -h                        shows this help text

  *****************
  * Filter Script *
  *****************

The BamTools filter tool allows you to use an external filter script to define
complex filtering behavior.  This script uses what I'm calling properties,
filters, and a rule - all implemented in a JSON syntax.

  ** Properties **

A 'property' is a typical JSON entry of the form:

    "propertyName" : "value"

Here are the property names that BamTools will recognize:

    alignmentFlag
    cigar
    insertSize
    isDuplicate
    isFailedQC
    isFirstMate
    isMapped
    isMateMapped
    isMateReverseStrand
    isPaired
    isPrimaryAlignment
    isProperPair
    isReverseStrand
    isSecondMate
    mapQuality
    matePosition
    mateReference
    name
    position
    queryBases
    reference
    tag

For properties with boolean values, use the words "true" or "false".
For example,

   "isMapped" : "true"

will keep only alignments that are flagged as 'mapped'.

For properties with numeric values, use the desired number with optional
comparison operators ( >, >=, <, <=, !). For example,

    "mapQuality" : ">=75"

will keep only alignments with mapQuality greater than or equal to 75.

If you're familiar with JSON, you know that integers can be bare (without
quotes). However, if you a comparison operator, be sure to enclose in quotes.

For string-based properties, the above operators are available.  In addition,
 you can also use some basic pattern-matching operators.  For example,

    "reference" : "ALU*"  // reference starts with 'ALU'
    "name" : "*foo"       // name ends with 'foo'
    "cigar" : "*D*"       // cigar contains a 'D' anywhere

Notes -
The reference property refers to the reference name, not the BAM reference
numeric ID.

The tag property has an extra layer, so that the syntax will look like this:

    "tag" : "XX:value"

where XX is the 2-letter SAM/BAM tag and value is, well, the value.
Comparison operators can still apply to values, so tag properties of:

    "tag" : "AS:>60"
    "tag" : "RG:foo*"

are perfectly valid.

 ** Filters **

A 'filter' is a JSON container of properties that will be AND-ed together. For
example,

{
   "reference" : "chr1",
   "mapQuality" : ">50",
   "tag" : "NM:<4"
}

would result in an output BAM file containing only alignments from chr1 with a
mapQuality >50 and edit distance of less than 4.

A single, unnamed filter like this is the minimum necessary for a complete
filter script.  Save this file and use as the -script parameter and you should
be all set.

Moving on to more potent filtering...

You can also define multiple filters.
To do so, you just need to use the "filters" keyword along with JSON array
syntax, like this:

{
   "filters" :
      [
        {
          "reference" : "chr1",
          "mapQuality" : ">50"
        },
        {
          "reference" : "chr1",
          "isReverseStrand" : "true"
        }
      ]
}

These filters will be (inclusive) OR-ed together by default. So you'd get a
resulting BAM with only alignments from chr1 that had either mapQuality >50 or
on the reverse strand (or both).

 ** Rule **

Alternatively to anonymous OR-ed filters, you can also provide what I've called
a "rule". By giving each filter an "id", using this "rule" keyword you can
describe boolean relationships between your filter sets.

Available rule operators:

    &   // and
    |   // or
    !   // not

This might sound a little fuzzy at this point, so let's get back to an example:

{
   "filters" :
      [
        {
          "id" : "filter1",
          "reference" : "chr1",
          "mapQuality" : ">50"
        },
        {
          "id" : "filter2",
          "reference" : "chr1",
          "isReverseStrand" : "true"
        },
        {
          "id" : "filter3",
          "reference" : "chr1",
          "queryBases" : "AGCT*"
        }
      ],

   "rule" : " (filter1 | filter2) & !filter3 "
}

In this case, we would only retain aligments that passed filter 1 OR filter 2,
AND also NOT filter 3.

These are dummy examples, and don't make much sense as an actual query case. But
hopefully this serves an adequate primer to get you started and discover the
potential flexibility here.

----------
header
----------

Description: prints header from BAM file(s).

Usage: bamtools header [-in <filename> -in <filename> ...]

Input & Output:
  -in <BAM filename>                the input BAM file(s) [stdin]

Help:
  --help, -h                        shows this help text

----------
index
----------

Description: creates index for BAM file.

Usage: bamtools index [-in <filename>] [-bti]

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]
  -bti                              create (non-standard) BamTools index file
                                       (*.bti). Default behavior is to create
                                       standard BAM index (*.bai)

Help:
  --help, -h                        shows this help tex

----------
merge
----------

Description: merges multiple BAM files into one.

Usage: bamtools merge [-in <filename> -in <filename> ...]
                      [-out <filename> | [-forceCompression]] [-region <REGION>]

Input & Output:
  -in <BAM filename>                the input BAM file(s)
  -out <BAM filename>               the output BAM file
  -forceCompression                 if results are sent to stdout (like when
                                       piping to another tool), default behavior
                                       is to leave output uncompressed. Use this
                                       flag to override and force compression
  -region <REGION>                  genomic region. See README for more details

Help:
  --help, -h                        shows this help text

----------
random
----------

Description: grab a random subset of alignments.

Usage: bamtools random [-in <filename> -in <filename> ...]
                       [-out <filename>] [-forceCompression] [-n]
                       [-region <REGION>]

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]
  -out <BAM filename>               the output BAM file [stdout]
  -forceCompression                 if results are sent to stdout (like when
                                       piping to another tool), default behavior
                                       is to leave output uncompressed. Use this
                                       flag to override and force compression
  -region <REGION>                  only pull random alignments from within this
                                       genomic region. Index file is
                                       recommended for better performance, and
                                       is used automatically if it exists. See
                                       'bamtools help index' for more details
                                       on creating one

Settings:
  -n <count>                        number of alignments to grab. Note that no
                                       duplicate checking is performed [10000]

Help:
  --help, -h                        shows this help text

----------
sort
----------

Description: sorts a BAM file.

Usage: bamtools sort [-in <filename>] [-out <filename>] [sortOptions]

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]
  -out <BAM filename>               the output BAM file [stdout]

Sorting Methods:
  -byname                           sort by alignment name

Memory Settings:
  -n <count>                        max number of alignments per tempfile
                                       [10000]
  -mem <Mb>                         max memory to use [1024]

Help:
  --help, -h                        shows this help text

----------
split
----------

Description: splits a BAM file on user-specified property, creating a new BAM
output file for each value found.

Usage: bamtools split [-in <filename>] [-stub <filename stub>]
                      < -mapped | -paired | -reference | -tag <TAG> >

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]
  -stub <filename stub>             prefix stub for output BAM files (default
                                       behavior is to use input filename,
                                       without .bam extension, as stub). If
                                       input is stdin and no stub provided, a
                                       timestamp is generated as the stub.

Split Options:
  -mapped                           split mapped/unmapped alignments
  -paired                           split single-end/paired-end alignments
  -reference                        split alignments by reference
  -tag <tag name>                   splits alignments based on all values of TAG
                                       encountered (i.e. -tag RG creates a BAM
                                       file for each read group in original
                                       BAM file)

Help:
  --help, -h                        shows this help text

----------
stats
----------

Description: prints general alignment statistics.

Usage: bamtools stats [-in <filename> -in <filename> ...] [statsOptions]

Input & Output:
  -in <BAM filename>                the input BAM file [stdin]

Additional Stats:
  -insert                           summarize insert size data

Help:
  --help, -h                        shows this help text

--------------------------------------------------------------------------------
IV. License :
--------------------------------------------------------------------------------

Both the BamTools API and toolkit are released under the MIT License.
Copyright (c) 2009-2010 Derek Barnett, Erik Garrison, Gabor Marth,
    Michael Stromberg

See included file LICENSE for details.

--------------------------------------------------------------------------------
V. Acknowledgements :
--------------------------------------------------------------------------------

 * Aaron Quinlan for several key feature ideas and bug fix contributions
 * Baptiste Lepilleur for the public-domain JSON parser (JsonCPP)
 * Heng Li, author of SAMtools - the original C-language BAM API/toolkit.

--------------------------------------------------------------------------------
VI. Contact :
--------------------------------------------------------------------------------

Feel free to contact me with any questions, comments, suggestions, bug reports,
    etc.

Derek Barnett
Marth Lab
Biology Dept., Boston College

Email: barnetde@bc.edu
Project Websites: http://github.com/pezmaster31/bamtools   (ACTIVE SUPPORT)
                  http://sourceforge.net/projects/bamtools (major updates only)

About

API and toolkit for reading, writing, and manipulating BAM (genome alignment) files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published