Metagenomic analysis pipeline for synthetic long reads. Nanoscope is introduced in
High-resolution structure of the human microbiome revealed with synthetic long reads.
Volodymyr Kuleshov, Chao Jiang, Wenyu Zhou, Fereshteh Jahanbani, Serafim Batzoglou, Michael Snyder.
Nature Biotechnology, 2015.
Nanoscope assumes that the following basic UNIX and genomic analysis utilities are installed on your system:
- GNU awk
- sed
- perl v5 or higher
- python v2
- GNU make
- GNU Parallel
- NCBI Blast >= 2.2.25
- BWA >= 0.7.5
- samtools >= 0.1.18
- bedtools >= 2.17.0
- pysam >= 0.8.2
The programs awk, perl, python, and blast must be in your $PATH
during installation.
Paths to all the others can be specified at runtime.
The Nanoscope uses the following software in its analysis:
- Soapdenovo 2.04
- Celera assembler 8.1
- AMOS 3.1.0
- SPAdes 3.5.0
- Mummer 3.13 (modified)
- CD-Hit 4.6 (modified)
- FCP 1.0.5
- Quast 2.3
- Lens 1.0
The source code of these programs is part of the git repository; it needs to be compiled during the installation process. Several programs have been modified to work with long reads; therefore you need to specifically compile the source code included in this package. The programs' dependecies are part of a standard Linux setup; if some dependecies are missing, you will be notified at installation time.
A Nanoscope installation uses approximately 50G of disk space. Each Nanoscope run requires about 50G of disk space and 15G of RAM per long read library. Most stages can be parallelized, and we recommend using at least 16 cores.
In our experience, one run can take anywhere between one day (1 library, 10 cores) to one week (7 libraries, 25 cores).
Before installing Nanoscope, you need to make sure that
awk, perl, python, and makeblastdb are in your $PATH
.
To install Nanoscope, clone the git repo and run
the installation script for the third-party software:
git clone https://github.com/kuleshov/nanoscope.git;
cd nanoscope;
git submodule init;
git submodule update;
cd sw;
bash install.sh
The installation of third-party tools can take up to 5 hours. The longest part is the installation of the FCP package, which involves downloading the NCBI RefSeq genomes (7.5G) and building their blast database.
The only package used by Nanoscope not included in the repository is the optional SPAdes assembler. To compile SPAdes, the user needs the latest version of several somewhat esoteric tools on which we do not want Nanoscope to depend. To use SPAdes, we ask the user to manually download binaries for their system, and add the appropriate path to the Nanoscope configuration.
To test whether the pipeline was installed succefully, we have provided a small testing package in
nanoscope/test
:
cd nanoscope/test;
make run;
This executes Nanoscope on a small subset of reads in fastq format.
Running the pipeline involves two steps: (1) creating and initializing a new directory in which the run will take place; (2) starting the run with some additional run-time specific paramters.
Both of steps are done using the wrapper script nanoscope.py
:
usage: nanoscope.py [-h] {status,init,run} ...
optional arguments:
-h, --help show this help message and exit
Commands:
{status,init,run}
init Initialize pipeline
run Start/resume pipeline
status Display pipeline status
A run is initialized using the init
command:
usage: nanoscope.py init [-h] -l LONG [-s SHORT_SINGLE]
[-p SHORT_PAIRED SHORT_PAIRED]
[-c CONTIGS [CONTIGS ...]]
[--short-insert-size SHORT_INSERT_SIZE]
[--short-read-length SHORT_READ_LENGTH] [--path PATH]
folder
positional arguments:
folder Pipeline folder
optional arguments:
-h, --help show this help message and exit
-l LONG, --long LONG Long reads fastq
-s SHORT_SINGLE, --short-single SHORT_SINGLE
Short reads fastq (unpaired)
-p SHORT_PAIRED SHORT_PAIRED, --short-paired SHORT_PAIRED SHORT_PAIRED
Short reads fastq (paired)
-c CONTIGS [CONTIGS ...], --contigs CONTIGS [CONTIGS ...]
Pre-assembled contigs
--short-insert-size SHORT_INSERT_SIZE
Short read insert size
--short-read-length SHORT_READ_LENGTH
Short read insert size
--path PATH Path to nanoscope
During initialization, the user must specify input DNA sequences:
- A mandatory fastq file with long reads
- A highly recommended fastq file with short reads (paired-end or unpaired). Short reads are required for the abundance estimation stage.
- An optional set of pre-assembled contigs that will be merged with the assembled contigs
For example:
python bin/nanoscope.py init \
-l test/reads.toy.long.fastq.gz \
-s test/reads.toy.short.fastq \
test/test-run
The folder test/test-run
will be populated with folder containing
scripts that will launch various steps of the analysis.
The most important of these scipts is test/test-run/config-and-run.sh
.
This file contains all the configuration parameters used across the pipeline.
They will all be set to sensible default values; advanced users however may
choose to customize this file. The most important options are:
- Paths to various programs
- Assembler options (see also
test-run/config-files
for that) - Flags for turning off assembly merging or the entire assembly stage
The run
command starts a new run:
usage: nanoscope.py run [-h] [-p PROCESSORS] [-r] [--up-to UP_TO] [--skip-asm]
[--skip-minimus] [--spades]
folder
positional arguments:
folder Pipeline folder
optional arguments:
-h, --help show this help message and exit
-p PROCESSORS, --processors PROCESSORS
Number of processors to use
-r, --restart Resume pipeline from the begginning
--up-to UP_TO Compute up to given stage
--skip-asm Skip read assembly
--skip-minimus Skip minimus contig merging
--spades Assemble short and long reads with spades
This executes the config-and-run.sh
script in the run folder,
which sources the pipeline configuration settings and starts
the pipeline scripts.
For example:
python bin/nanoscope.py run \
-p 10 \
test/test-run
The run
command also supports additional options such as --skip-asm
which skips
read assembly, --skip-minimus
, which skips assembly merging, and --up-to
which
runs the pipeline only up to a given stage. Finally, use --spades
to assemble
both short and long reads using the SPAdes assembler; this produces longer contigs
but may assemble much fewer total sequence.
The output of Nanoscope will be found in the results
subfolder.
Output includes:
asm.report.txt
: a Quast report of the assembly resultstaxa.abundances
: taxa identified by Nanoscope and their estimated abundancestaxa.contig.assignments
: taxonomic labels of assembled contigstaxa.haplotypes
: bacterial haplotypes produced by Lens; see the Lens documentation for more info
Please send feedback to Volodymyr Kuleshov.