NGS Analyzer-MINI

version: 1.1 (based on NGS Analyzer 1.0)
update: 2019/03/20
contact: miniapp@riken.jp

About NGS Analyzer

This miniapp is based on a genome analysis tool, NGS Analyzer, which has been developed in Center for Genomic Medicine, RIKEN. NGS Analyzer analyzes output data generated by a next-generation genome sequencer in high speed and identifies genetic differences among persons or cancer cell's mutations much precisely.

Original NGS Analyzer

Install

Compile

Compiling this program requires a C/C++ compiler which supports OpenMP, an MPI library and Gnu Make. This package provides Makefiles for K computer/Fujitsu FX10 and Intel X86_64 architecture machines.

Download this package and extract its contents

make

K computer/FX10
```
   $ make -f makefile.k
```
X86_64 machines with GCC
```
   $ make -f makefile.x86_64_gcc
```

You can find executables under 'bin/' directory if make succeeds.

Prepare data

As this package does not provide input data, you need to create it. To create data, refer to README_DATA.md.

What NGS Analyzer-MINI does - brief explanation

This miniapp provides shellscripts that execute a genome analysis workflow of the original NGS Analyzer. These shellscripts consist of two programs: a script that executes the whole workflow and three scripts that execute each step of the workflow.

The workflow consists of the following steps.

Mapping
Performs sequence alignment, that is, it calculates the best matches for the input genome and reference genome. After alignment, it splits the results into files based on contigs (DNA segments).
Merge
Merges the alignment results per contigs. This step is performed when the input genome is split and then aligned in parallel.
Remove duplicates
Removes duplicated sequence patterns from the alignment results.
Detect mutations
Statistically identifies genetic differences.

Each step is implemented by using well-known and widely-used alignment tool and generic alignment format tool, etc.

How to Run the program

Running the whole workflow

This package provides a program that sequentially executes Mapping, Merge, Remove duplicates and Detect mutation as 'workflow'. 'workflow' can be run in parallel using multiple processes using MPI.

How to run

Place input files
Create directories whose names are MPI rank numbers, and then copy file pairs under ./input/00-read directory. In the following example, we assume that 12 gene file pairs are processed by 6 nodes (MPI processes).

 $ cd ./input
 $ mkdir -p 00-read-rank/0
 $ mkdir -p 00-read-rank/1
 $ mkdir -p 00-read-rank/2
 $ mkdir -p 00-read-rank/3
 $ mkdir -p 00-read-rank/4
 $ mkdir -p 00-read-rank/5
 $ cp 00-read/part_1.00 00-read-rank/0
 $ cp 00-read/part_2.00 00-read-rank/0
 $ cp 00-read/part_1.01 00-read-rank/0
 $ cp 00-read/part_2.01 00-read-rank/0
 $ cp 00-read/part_1.02 00-read-rank/1
 $ cp 00-read/part_2.02 00-read-rank/1
 $ cp 00-read/part_1.03 00-read-rank/1
 $ cp 00-read/part_2.03 00-read-rank/1
 $ cp 00-read/part_1.04 00-read-rank/2
 $ cp 00-read/part_2.04 00-read-rank/2
 $ cp 00-read/part_1.05 00-read-rank/2
 $ cp 00-read/part_2.05 00-read-rank/2
 $ cp 00-read/part_1.06 00-read-rank/3
 $ cp 00-read/part_2.06 00-read-rank/3
 $ cp 00-read/part_1.07 00-read-rank/3
 $ cp 00-read/part_2.07 00-read-rank/3
 $ cp 00-read/part_1.08 00-read-rank/4
 $ cp 00-read/part_2.08 00-read-rank/4
 $ cp 00-read/part_1.09 00-read-rank/4
 $ cp 00-read/part_2.09 00-read-rank/4
 $ cp 00-read/part_1.10 00-read-rank/5
 $ cp 00-read/part_2.10 00-read-rank/5
 $ cp 00-read/part_1.11 00-read-rank/5
 $ cp 00-read/part_2.11 00-read-rank/5
 $ cd ..

Run
Run as an interactive job.

 $ mpiexec -n 6 ./bin/workflow ./input/bwa_db/reference.fa \
 ./input/seq_contig.md ./input/reference.fa ./input/reference.fa.fai \
 ./input/00-read-rank

After job completes, workflow_OUT directory is created in the current directory and you can find output in the directory.

A sample jobscript for K computer that performs the above procedure 1 and 2 using rank-directory is provided as k_job_script/workflow-job.sh. This jobscript uses 12 nodes.

    $ pjsub ./k_job_script/workflow-job.sh

Running each step of the workflow

This package provides three scripts each of which runs Mapping, Remove duplicates and Detect mutations. We can estimate the performance of large scale analysis by running and profiling these scripts.

Mapping

File IO

Input files
- Target read pair (two files)
- Reference sequence file formatted for BWA(six files)
- Information on contigs in the reference sequence (one file)
Output files
- Mapping results split based on Contigs
Intermediate files
- Suffix Array (SA) files of the input read pair (two files)
- bwa alignment result (one file)

How to run

Run bin/01-alignment.sh. You can specify the number of threads to calculate SA as a parameter. Default is 8.

$ ./bin/01-alignment.sh -t 8 ./input/bwa_db/reference.fa \
./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \
./input/01-alignment/sra_2.fastq.0

The intermediate files are generated in 01-alignment.sh_MED/ directory and the output files are generated in 01-alignment.sh_OUT directory.

A jobscript for K computer is provided as k_job_script/01-alignment-job.sh.

Remove duplicates

File IO

Input files
- SAM file
- Index of reference sequence
Output file
- BAM file (duplicates are removed)
Intermediate files
- BAM file(include duplicates)
- Sorted BAM fileM(include duplicates)

How to run

Run bin/02-rmdup.sh. You can specify the amount of memory used in sort. Default is 800MB.

$ ./bin/02-rmdup.sh ./input/reference.fa.fai \
./input/02-rmdup/NT_008818.16.sam

The intermediate files are generated in 02-rmdup.sh_MED/ directory and the output files are generated in 02-rmdup.sh_OUT/ directory.

A jobscript for K computer is provided as k_job_script/02-rmdup-job.sh.

Detect mutations

File IO

Input files
- BAM file
- Reference sequence
Output files
- Indel, SNP and summary files
Intermediate files
- pileup file

How to run

Run bin/03-snp.sh.

$ ./bin/03-snp.sh ./input/reference.fa ./input/03-snp/NT_008818.16.bam

The intermediate files are generated in 03-snp.sh_MED/ directory and the output files are generated in 03-snp.sh_OUT/ directory.

A jobscript for K computer is provided as k_job_script/03-snp-job.sh.

Estimate the performance of large scale analysis

This package also provides a method for estimating the performance of large scale analysis using the results of small analysis obtained by provided scripts. You can estimate it by running each program in 'Running each step of the workflow', obtaining their profiles and filling them in an attached Excel sheet, model/perf-model.xlsx.

The model has the following assumption.

The model does not assume load on the storage devices. As a result, the calculated IO performance means the maximum required performance. Moreover, there will be large difference between the model and the real execution time when multiple nodes are used.
The cost for 'Merge' is not included in the model.
In 'Remove duplicates' and 'Detect mutations' step, depending on patterns of input genome, sizes of input files widely vary. This model shows the estimated performance of a case where size of all files are almost same and it does not show that of a case where size of files differ.

Obtain profiles

You need to obtain elapse time, FLOPS, MIPS and memory throughput of the above three programs in 'Running each step of the workflow' by using profiler. You also need to count size of input/intermediate/output files of each program.

This package provides scripts for obtaining these profiles on K computer and Fujitsu FX10 using Fujitsu profiler. You can use them as follow.

Execute each program with '-p' profiling option.
```
 $ ./bin/01-alignment.sh -p -t 8 ./input/bwa_db/reference.fa \
 ./input/seq_contig.md ./input/01-alignment/sra_1.fastq.0 \
 ./input/01-alignment/sra_2.fastq.0
 $ ./bin/02-rmdup.sh ../data/reference.fa.fai \
 ../data/02-rmdup/NT_005120.16.sam
 $ ./bin/03-snp.sh ../data/reference.fa ../data/03-snp/NT_005120.16.bam
```
By executing with this option, profiles (01-alignment.sh_PROF, 02-rmdup.sh_PROF, 03-snp.sh_PROF) are generated as well as intermediate and output files. Jobscripts for K computer are provided as k_job_script/01-alignment-job-profile.sh, k_job_script/02-rmdup-job-profile.sh and k_job_script/03-snp-job-profile.sh.

Obtain information by running profile parser.

 $ ./bin/parse-01-alignment.rb ./input/01-alignment/sra_1.fastq.0 \
 ./input/01-alignment/sra_2.fastq.0 01-alignment-job.sh_MED \
 01-alignment-job.sh_OUT 01-alignment-job.sh_PROF
 $ ./bin/parse-02-rmdup.rb ./input/02-rmdup/NT_008818.16.sam \
 ./02-rmdup-job.sh_MED ./02-rmdup-job.sh_OUT ./02-rmdup-job.sh_PROF
 $ ./bin/parse-03-snp.rb ./input/03-snp/NT_008818.16.bam \
 ./03-snp-job.sh_MED ./03-snp-job.sh_OUT ./03-snp-job.sh_PROF

Fill results in highlighted cells in page 2-4 of the Excel sheet.

Estimate the performance of large scale analysis

Fill number of input files and number of CPUs of each step in highlighted cells in the 'Parameters for a large scale run' table in the Summary sheet. Then values in 'Calculated performance of a large scale run' table are automatically updated.

Estimate the required performance when maximum elapse time is given

This model calculates the required performance of each workflow step when the execution should be completed within the specified elapse time.

Fill elapse time and number of files of each step in highlighted cells in the 'Parameters for time restriction' table in the Summary sheet. Then values in 'Estimated performance under given time threshold' table are automatically updated.

Estimate the required performance for staging when maximum staging time is given

This model calculates the required IO performance of stage-in/out of input/output files on a system that employs staging when the staging should be completed within the specified time.

Fill required time for stage-in/out, number of target files and number of nodes in highlighted cells in 'Parameters for time restriction - Staging' table in the Summary sheet. Then values in 'Estimated time for staging under time threshold' table are automatically updated.

Target exa-scale problem setting

Size of one genome data
- 100 times bigger than current size (100TB/human)
Number of genomes: 200,000
- Size of total data is 20,000 times bigger than current size (2,000PB)
Required computational power
- When analyzing in three years, 190 genomes should be processed a day.
  - 455 seconds/genome
- When analyzing in five years, 110 genomes should be processed a day.
  - 785 seconds/genome
- The analyses can be performed in multiple sites in parallel.

Changes from original NGS Analyzer

Remove unnecessary sub-programs
Remove unnecessary codes
- Reduce command line options
- Remove comments and never-executed codes
remove shellscripts and configuration files used when running the original NGS Analyzer on K computer
Implement scripts for executing the genome analysis workflow
- a script for running the whole workflow with multiple MPI processes
- scripts for individually running major steps of the workflow

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
SNP_indel_caller		SNP_indel_caller
bwa-0.5.9rc1_kei		bwa-0.5.9rc1_kei
input		input
k_job_script		k_job_script
model		model
samtools-0.1.8_kei		samtools-0.1.8_kei
script		script
splitSam2Contig2		splitSam2Contig2
workflow		workflow
AUTHORS		AUTHORS
COPYING		COPYING
README.md		README.md
README_DATA.md		README_DATA.md
README_DATA_jp.md		README_DATA_jp.md
README_jp.md		README_jp.md
makefile.k		makefile.k
makefile.x86_64_gcc		makefile.x86_64_gcc

License

fiber-miniapp/ngsa-mini

Folders and files

Latest commit

History

Repository files navigation

NGS Analyzer-MINI

About NGS Analyzer

Install

Compile

Prepare data

What NGS Analyzer-MINI does - brief explanation

How to Run the program

Running the whole workflow

How to run

Running each step of the workflow

Mapping

File IO

How to run

Remove duplicates

File IO

How to run

Detect mutations

File IO

How to run

Estimate the performance of large scale analysis

Obtain profiles

Estimate the performance of large scale analysis

Estimate the required performance when maximum elapse time is given

Estimate the required performance for staging when maximum staging time is given

Target exa-scale problem setting

Changes from original NGS Analyzer

About

Resources

License

Stars

Watchers

Forks

Languages