Skip to content

mrG7/HipSTR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HipSTR Build Status

Haplotype-based imputation, phasing and genotyping of STRs

Author: Thomas Willems twillems@mit.edu

License: GNU v2

Introduction

Short tandem repeats (STRs) are highly repetitive genomic sequences comprised of repeated copies of an underlying motif. Prevalent in most organisms' genomes, STRs are of particular interest because they mutate much more rapidly than most other genomic elements. As a result, they're extremely informative for genomic identification, ancestry inference and genealogy.

Despite their utility, STRs are particularly difficult to genotype. The repetitive sequence responsible for their high mutability also results in frequent alignment errors that can complicate and bias downstream analyses. In addition, PCR stutter errors often result in reads that contain additional or fewer repeat copies than the true underlying genotype.

HipSTR was specifically developed to deal with these errors in the hopes of obtaining more robust STR genotypes. In particular, it accomplishes this by:

  1. Learning locus-specific PCR stutter models using an [EM algorithm] (http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
  2. Mining candidate STR alleles from population-scale sequencing data
  3. Utilizing phased SNP haplotypes to genotype, phase and/or impute STRs
  4. Employing a specialized hidden Markov model to align reads to candidate sequences while accounting for stutter

Installation

HipSTR requires a standard c++ compiler, CMake as well as Java version 1.7 or later. To obtain HipSTR and all of its associated submodules, use:

% git clone --recursive https://github.com/tfwillems/HipSTR.git

To build, use Make:

% cd HipSTR
% make

On Mac, before running Make, change the line in vcflib/smithwaterman/Makefile from LDFLAGS=-Wl,-s to LDFLAGS=-Wl

Quick Start

To run HipSTR in its most broadly applicable mode, run it on all samples concurrently using the syntax:

./HipSTR --bams          run1.bam,run2.bam,run3.bam,run4.bam
         --fasta         /data/
         --regions       str_regions.bed
         --stutter-out   stutter_models.txt
         --str-vcf       str_calls.vcf.gz
  • --bam : a comma-separated list of BAM files generated by any indel-sensitive aligner (e.g. lobSTR or BWA-MEM) and sorted and indexed using samtools
  • --regions : a BED file containing the coordinates for each STR region of interest
  • --fasta : the directory containing [FASTA files] (https://en.wikipedia.org/wiki/FASTA_format) for each chromosome in the BED file. In the above usage example, if str_regions.bed contains chr1, chr2, and chr10, the corresponding files would be /data/chr1.fa, /data/chr2.fa and /data/chr10.fa. Alternatively, you can supply the path for a single FASTA file containing all of the relevant sequences

For each region in str_regions.bed, HipSTR will:

  1. Learn locus-specific stutter models and output them to stutter_models.txt
  2. Use the stutter model and haplotype-based alignment algorithm to genotype each individual
  3. Output the resulting STR genotypes to str_calls.vcf.gz, a [bgzipped] (http://www.htslib.org/doc/tabix.html) VCF file. This VCF will contain calls for each sample in any of the BAM files' read groups.

In-depth usage

HipSTR has a variety of usage options designed to accomodate scenarios in which the sequencing data varies in terms of the number of samples and the coverage. Most scenarios will fall into one of the following categories:

  1. 200 or more low-coverage (~5x) samples
  2. 50 or more high-coverage (~30x) samples
  3. Handful of low-coverage (~5x) samples
  4. Handful of high-coverage (~30x) samples

1. De novo stutter estimation + STR calling with de novo allele generation

This mode is identical to the one suggested in the Quick Start section as it suits most applications. HipSTR will output the learned stutter models to stutter_models.txt and the STR genotypes in bgzipped VCF format to str_calls.vcf.gz

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-out      stutter_models.txt
         --str-vcf          str_calls.vcf.gz

2. External stutter models + STR calling with de novo allele generation

The sole difference in this mode is that we no longer output stutter models using the --stutter-out option and instead input them from a file using the --stutter-in file. For more details on the stutter model file format, see below. For humans, we've provided a file containing stutter models for each STR locus under PCR or PCR-free conditions at FILL ME IN

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-in       ext_stutter_models.txt
         --str-vcf          str_calls.vcf.gz

3. External stutter models + STR calling with a reference panel

This mode is very similar to mode #2, except that we provide an additional VCF file containing known STR genotypes at each locus using the --str-vcf option. HipSTR will not identify any additional candidate STR alleles in the BAMs when this option is specified, so it's best to use a VCF that contains STR genotypes for a wide range of populations and individuals. For humans, we've provided such a file based on the 1000 Genomes Project at FILL ME IN

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-in       ext_stutter_models.txt
         --ref-vcf          ref_strs.vcf.gz
         --str-vcf          str_calls.vcf.gz

STR imputation

Speed

HipSTR doesn't currently have multi-threaded support, but there are several options available to accelerate analyses:

  1. Analyze each chromosome in parallel using the --chrom option. For example, --chrom chr2 will only genotype the BED regions on chr2
  2. Utilize the length-based genotyper instead of the sequence-based genotyper using the --len-genotyper flag. This will vastly accelerate analyses, but it only considers the sizes of indels in reads independently. As a result, it is incapable of determining the exact sequence of an STR and is susceptible to alignment errors
  3. If you have hundreds of BAM files, we recommend that you merge them into a more manageable number (10-100) using the samtools merge command. Large numbers of BAMs can lead to slow disk IO and poor performance

Call Filtering

Additional Usage Options

Option Description
--viz-out <aln_viz.html.gz> Output a bgzipped file containing Needleman-Wunsch alignments
for each locus. The resulting file can be readily visualized with VizAln
--haploid-chrs <list_of_chroms> Comma separated list of chromosomes to treat as haploid
By default, all chromosomes are treated as diploid
--no-rmdup Don't remove PCR duplicates. By default, they'll be removed
--snp-vcf <phased_snps.vcf.gz> Bgzipped VCF file containing phased SNP genotypes for samples
that are being genotyped. These SNPs will be used to physically
phase any STRs when a read or its mate pair overlaps a heterozygous site
Always use this option if you have available phased SNP genotypes
--bam-samps <list_of_read_groups> Comma separated list of samples in same order as BAM files.
Assign each read to the sample corresponding to its file. By default,
each read must have an RG flag and the associated sample is used instead
--bam-lbs <list_of_read_groups> Comma separated list of libraries in same order as BAM files.
Assign each read to the library corresponding to its file. By default,
each read must have an RG flag and the associated library is used instead
NOTE: This option is required when --bam-samps has been specified

This list is comprised of the most useful and frequently used additional options, but is not all encompassing. For a complete list of options, please type either ./HipSTR or ./HipSTR --help

Alignment Visualization

When deciphering and inspecting STR calls, it's extremely useful to visualize the supporting reads. HipSTR facilitates this through the --viz-out option, which writes a bgzipped file containing alignments for each call that can be readily visualized using the VizAln command included in HipSTR main directory. If you're interested in visualizing alignments, you first need to index the file using tabix. For example, if you ran HipSTR with the option --viz-out alns.html.gz, you should use the command

tabix -p bed aln.html.gz

to generate a [tabix] (http://www.htslib.org/doc/tabix.html) index for the file so that we can rapidly extract alignments for a locus of interest. This command only needs to be run once after the file has been generated.

You could then visualize the calls for sample ERR218433 at locus chr1 51639636 using the command

./VizAln aln.html.gz chr1 51639636 ERR218433

This command will automatically open a rendering of the alignments in your browser and might look something like: Read more words! The top bar represents the reference sequence and the red text indicates the name of the sample and its associated call at the locus. The remaining rows indicate the alignment for each read used in genotyping. In this particular example, 7 reads have a 4bp deletion, 9 reads have a 8bp deletion and 1 read has a 12bp deletion. The solitary 12bp deletion is likely the result of PCR stutter and HipSTR therefore genotypes this sample as -4 | -8

If we wanted to inspect all calls for the same locus, we could use the command

./VizAln aln.html.gz chr1 51639636

NOTE: Because the --viz-out file can become fairly large if you're genotyping thousands of loci or thousands of samples, in some scenarios it may be best to rerun HipSTR using this option on the subset of loci in which you're interested.

File Formats

BAM files

HipSTR requires BAM files produced by any indel-sensitive aligner. These files must have been sorted by position using the samtools sort command and then indexed using samtools index. To associate a read with its sample of interest, HipSTR uses read group information in the BAM header lines. These @RG lines must contain an ID field, an LB field indicating the library and an SM field indicating the sample. For example, if a BAM contained the following header line

@RG ID:lobSTR;RUN1 LB:ERR044603 SM:HG01914

an alignment with the RG tag

RG:Z:lobSTR;RUN1

will be associated with sample HG01914 and library ERR044603. In this manner, HipSTR can analyze BAMs containing more than one sample and/or more than one library and can handle BAMs in which a single sample's reads are spread across multiple files.

Alternatively, if your BAM files lack RG information, you can use the --bam-samps and -bam-lbs flags to specify the sample and library associated with each BAM. In this setting, however, a BAM can only contain a single library and a single read group. For example, the command

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-out      stutter_models.txt
         --str-vcf          str_calls.vcf.gz
         --bam-samps        SAMPLE1,SAMPLE1,SAMPLE2,SAMPLE3
         --bam-lbs          LIB1,LIB2,LIB3,LIB4

essentially tells HipSTR to associate all the reads in the first two BAMS with SAMPLE1, all the reads in the third file with SAMPLE2 and all the reads in the last BAM with SAMPLE3.

STR region BED file

The BED file containing each STR region of interest is a tab-delimited file comprised of 5 required columns and one optional column:

  1. The name of the chromosome on which the STR is located
  2. The start position of the STR on its chromosome
  3. The end position of the STR on its chromosome
  4. The motif length (i.e. the number of bases in the repeat unit)
  5. The number of copies of the repeat unit in the reference allele

The 6th column is optional and contains the name of the STR locus, which will be written to the ID column in the VCF. Below is an example file which contains 5 STR loci

NOTE: The table header is for descriptive purposes. The BED file should not have a header

CHROM START END MOTIF_LEN NUM_COPIES NAME
chr1 13784267 13784306 4 10 GATA27E01
chr1 18789523 18789555 3 11 ATA008
chr2 32079410 32079469 4 15 AGAT117
chr17 38994441 38994492 4 12 GATA25A04
chr17 5529940 55299992 4 13 AAT245

For humans, we've provided a BED file containing STR loci for hg19 at FILL ME IN

For other model organisms, we recommend that you

  1. Use Tandem Repeats Finder or other tools to scan the reference genome for STRs
  2. Reformat the resulting output to conform with the format outlined above

VCF file

For more information on the VCF file format, please see the VCF spec. For filtering and parsing VCFs, we recommend the fantastic python package PyVCF

INFO fields

INFO fields contains statistics about each genotyped locus in the VCF. The INFO fields reported by HipSTR primarily describe the learned/supplied stutter model for the locus and its reference coordinates and sequence characteristics.

FIELD DESCRIPTION
INFRAME_PGEOM Parameter for in-frame geometric step size distribution
INFRAME_UP Probability that stutter causes an in-frame increase in obs. STR size
INFRAME_DOWN Probability that stutter causes an in-frame decrease in obs. STR size
OUTFRAME_PGEOM Parameter for out-of-frame geometric step size distribution
OUTFRAME_UP Probability that stutter causes an out-of-frame increase in obs. STR size
OUTFRAME_DOWN Probability that stutter causes an out-of-frame decrease in obs. STR size
BPDIFFS Base pair difference of each alternate allele from the reference allele
START Inclusive start coodinate for the repetitive portion of the reference allele
END Inclusive end coordinate for the repetitive portion of the reference allele
PERIOD Length of STR motif
AC Alternate allele counts
NSKIP Number of samples not genotyped due to various issues

FORMAT fields

FORMAT fields contain information about the genotype for each sample at the locus. In addition to the most probable phased genotype (GT), HipSTR reports information about the posterior likelihood of this genotype (PQ) and its unphased analog (Q). HipSTR also reports the expected base pair difference from the reference for the genotype by marginalizing over all genotype probabilities (BPDOSE), a value that may be useful in association studies.

FIELD DESCRIPTION
GT Genotype
GB Base pair differences of genotype from reference
Q Posterior probability of unphased genotype
PQ Posterior probability of phased genotype
DP Read depth
DSNP Number of reads with SNP phasing information
PDP Fractional reads supporting each haploid genotype
BPDOSE Posterior mean base pair difference from reference
ALLREADS Base pair difference observed in each read
PALLREADS Expected base pair diff in each read based on haplotype alignment probs
GL log-10 genotype likelihoods
PL Phred-scaled genotype likelihoods

Stutter model

To model PCR stutter artifacts, we assume that there are three types of stutter artifacts:

  1. In-frame changes: Change the size of the STR in the read by multiples of the repeat unit. For instance, if the repeat motif is AGAT, in-frame changes could lead to differences of -8, -4, 4, 8, and so on.
  2. Out-of-frame changes: Change the size of the STR by non-multiples of the repeat unit. For instance, if the repeat motif is AGAT, out-of-frame changes could lead to differences of -3, -2, -1, 1, 2, 3 and so on.
  3. No stutter change: The size of the STR in the read is the same as the size of the underlying STR.

Stutter model files contain the information necessary to model each of these artifacts in a tab-delimited BED-like format with exactly 9 columns. An example of such a file is as follows:

CHROM START END IGEOM IDOWN IUP OGEOM ODOWN OUP
chr1 13784267 13784306 0.95 0.05 0.01 0.9 0.01 0.001
chr1 18789523 18789555 0.8 0.01 0.05 0.9 0.001 0.001
chr2 32079410 32079469 0.9 0.01 0.01 0.9 0.001 0.001
chr17 38994441 38994492 0.9 0.001 0.001 0.9 0.001 0.001
chr17 5529940 55299992 0.95 0.01 0.01 0.9 0.001 0.001

NOTE: The table header is for descriptive purposes. The stutter file should not have a header

Each of the stutter parameters is defined as follows:

VARIABLE DESCRIPTION
IDOWN Probability that in-frame changes decrease the size of the observed STR allele
IUP Probability that in-frame changes increase the size of the observed STR allele
ODOWN Probability that out-of-frame changes decrease the size of the observed STR allele
OUP Probability that out-of-frame changes increase the size of the observed STR allele
IGEOM Parameter governing geometric step size distribution for in-frame changes
OGEOM Paramter governing geometric step size distribution for out-of-frame changes

About

Genotype, phase and impute short tandem repeats using whole-genome sequencing data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 89.9%
  • Python 8.3%
  • Makefile 1.2%
  • Shell 0.6%