vg

variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Variation graphs provide a succinct encoding of the sequences of many genomes. A variation graph (in particular as implemented in vg) is composed of:

nodes, which are labeled by sequences and ids
edges, which connect two nodes via either of their respective ends
paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges

This model is similar to a number of sequence graphs that have been used in assembly and multiple sequence alignment. Paths provide coordinate systems relative to genomes encoded in the graph, allowing stable mappings to be produced even if the structure of the graph is changed. For visual documentation, please refer to a presentation on the topic: Resequencing against a human whole genome variation graph (April 14, 2015).

Usage

building

You'll need the protobuf and jansson development libraries installed on your server.

sudo apt-get install protobuf-compiler libprotoc-dev libjansson-dev automake libtool

You can also run make get-deps.

Other libraries may be required. Please report any build difficulties.

Now, obtain the repo and its submodules:

git clone --recursive https://github.com/ekg/vg.git

Then build with make, and run with ./vg.

Variation graph construction

The simplest thing to do with vg is to build a graph and align to it. At present, you'll want to use a reference and VCF file to do so. If you're working in the test/ directory:

vg construct -r small/x.fa -v small/x.vcf.gz >x.vg

Viewing, conversion

vg view provides a way to convert the graph into various formats:

# GFA output
vg view x.vg >x.gfa

# dot output suitable for graphviz
vg view -d x.vg >x.dot

# json version of binary alignments
vg view -a x.gam >x.json

Alignment

As this is a small graph, you could align to it using a full-length partial order alignment:

vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG x.vg

Note that you don't have to store the graph on disk at all, you can simply pipe it into the local aligner:

vg construct -r small/x.fa -v small/x.vcf.gz | vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -

Most commands allow the streaming of graphs into and out of vg.

Mapping

If your graph is large, you want to use vg index to store the graph and vg map to align reads. vg map implements a kmer based seed and extend alignment model that is similar to that used in aligners like novoalign or MOSAIK. First an on-disk index is built with vg index which includes the graph itself and kmers of a particular size. When mapping, any kmer size shorter than that used in the index can be employed, and by default the mapper will decrease the kmer size to increase sensitivity when alignment at a particular k fails.

# construct the graph
vg construct -r small/x.fa -v small/x.vcf.gz >x.vg

# store the graph in the index, and also index the kmers in the graph of size 11
# you can provide a list of .vg files on the command line, which is useful if you
# have constructed a graph for each chromosome in a large reference
vg index -s -k 11 x.vg

# align a read to the indexed version of the graph
# note that the graph file is not opened, but x.vg.index is assumed
vg map -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG x.vg >read.gam

# simulate a bunch of 150bp reads from the graph and map them
vg map -r <(vg sim -n 1000 -l 150 x.vg) x.vg >aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -p x -b aln.gam >aln.bam

Command line interface

A variety of commands are available:

construct: graph construction
view: conversion (dot/protobuf/json/GFA)
index: index features of the graph in a disk-backed key/value store
find: use an index to find nodes, edges, kmers, or positions
paths: traverse paths in the graph
align: local alignment
map: global alignment (kmer-driven)
stats: metrics describing graph properties
join: combine graphs (parallel)
concat: combine graphs (serial)
ids: id manipulation
kmers: generate kmers from a graph
sim: simulate reads by walking paths in the graph
mod: various transformations of the graph
surject: force graph alignments into a linear reference space

Implementation notes

vg is based around a graph object (vg::VG) which has a native serialized representation that is almost identical on disk and in-memory, with the exception of adjacency indexes that are built when the object is parsed from a stream or file. These graph objects are the results of queries of larger indexes, or manipulation (for example joins or concatenations) of other graphs. vg is designed for interactive, stream-oriented use. You can, for instance, construct a graph, merge it with another one, and pipe the result into a local alignment process. The graph object can be stored in an index (vg::Index), aligned against directly (vg::GSSWAligner), or "mapped" against in a global sense (vg::Mapper), using an index of kmers.

Once constructed, a variation graph (.vg is the suggested file extension) is typically around the same size as the reference (FASTA) and uncompressed variant set (VCF) which were used to build it. The index, however, may be much larger, perhaps more than an order of magnitude. This is less of a concern as it is not loaded into memory, but could be a pain point as vg is scaled up to whole-genome mapping.

The serialization of very large graphs (>62MB) is enabled by the use of protocol buffer ZeroCopyStreams. Graphs are decomposed into sets of N (presently 10k) nodes, and these are written, with their edges, into graph objects that can be streamed into and out of vg. Graphs of unbounded size are possible using this approach.

Development

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 616 Commits
bash-tap @ c38fbfa		bash-tap @ c38fbfa
fastahack @ c68cebb		fastahack @ c68cebb
figures		figures
gssw @ 53e9e36		gssw @ 53e9e36
htslib @ f83dfd2		htslib @ f83dfd2
lru_cache @ 4ee97e5		lru_cache @ 4ee97e5
progress_bar @ f5cc1d1		progress_bar @ f5cc1d1
protobuf @ 16a283f		protobuf @ 16a283f
results		results
rocksdb @ 7246ad3		rocksdb @ 7246ad3
sha1 @ 6474be9		sha1 @ 6474be9
snappy @ 1ff9be9		snappy @ 1ff9be9
sparsehash @ 80b55f9		sparsehash @ 80b55f9
test		test
vcflib @ b1e9b31		vcflib @ b1e9b31
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alignment.cpp		alignment.cpp
alignment.hpp		alignment.hpp
convert.hpp		convert.hpp
entropy.cpp		entropy.cpp
entropy.hpp		entropy.hpp
gssw_aligner.cpp		gssw_aligner.cpp
gssw_aligner.hpp		gssw_aligner.hpp
hash_map.hpp		hash_map.hpp
index.cpp		index.cpp
index.hpp		index.hpp
json.cpp		json.cpp
json.hpp		json.hpp
main.cpp		main.cpp
mapper.cpp		mapper.cpp
mapper.hpp		mapper.hpp
path.cpp		path.cpp
path.hpp		path.hpp
pb2json.cpp		pb2json.cpp
pb2json.h		pb2json.h
region.cpp		region.cpp
region.hpp		region.hpp
stream.hpp		stream.hpp
swap_remove.hpp		swap_remove.hpp
utility.cpp		utility.cpp
utility.hpp		utility.hpp
vg.cpp		vg.cpp
vg.hpp		vg.hpp
vg.proto		vg.proto
vg_set.cpp		vg_set.cpp
vg_set.hpp		vg_set.hpp

License

alexjironkin/vg

Folders and files

Latest commit

History

Repository files navigation

vg

variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Usage

building

Variation graph construction

Viewing, conversion

Alignment

Mapping

Command line interface

Implementation notes

Development

License

About

Resources

License

Stars

Watchers

Forks

Languages