GitHub - mfumagalli/ngsF: Estimation of per-individual inbreeding coefficients under a probabilistic framework

mfumagalli / ngsF Public

forked from fgvieira/ngsF

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Estimation of per-individual inbreeding coefficients under a probabilistic framework

View license

0 stars 8 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bgzf		bgzf
examples		examples
EM.cpp		EM.cpp
LICENSE		LICENSE
Makefile		Makefile
README		README
ngsF.cpp		ngsF.cpp
parse_args.cpp		parse_args.cpp
read_data.cpp		read_data.cpp
shared.cpp		shared.cpp
shared.h		shared.h

Repository files navigation

# ngsF

ngsF is a program to estimate per-individual inbreeding coefficients, under a probabilistic framework that takes the uncertainty of genotype's assignation into account.



## Compile
To compile entire package just run the Makefile:

% make



## Run ngsF

% ./ngsF [options] -n_ind INT -glf /path/to/glf/file -out /path/to/output/file



### Options

-glf		FILE	Input GL file. Format as 3*n_ind*n_sites doubles in binary.
			It can be uncompressed [default] or in BGZIP format. If "-" 
			reads uncompressed stream from STDIN.

-out		FILE	Output file name.

-n_ind		INT	Sample size (number of individuals).

-n_sites	INT	Total number of sites.

-chunk_size	INT	Size of each analysis chunk. [default: 10000]

-call_geno		Assume called genotypes.

-approx_EM		Use the faster approximated EM ML algorithm

-fast_lkl		Enables an intra-EM LogLkl calculation.
			NOTE: For debugging use only! LogLkl is calculated while 
			estimating the next iteration parameters, speeding up 
			the computation (no need for between-iteration de-novo LKL 
			calculation). This LKL will reflect the previous iteration 
			values, and will miss all skipped sites (f == 0).

-init_values    CHAR    Initial values of individual F and site frequency:
 		 or	 r:     random in (0,1)
		FILE	 e:     estimated (not recommended for low coverage data)
			 u:	uniform = 0.01 [default]
			 FILE:  read from FILE (n_ind + n_sites doubles)

-max_iters	INT	Maximum number of EM iterations. [default: 1500]

-min_epsilon	FLOAT	Maximum parameters RMSD between iterations to assume 
			convergence. [default: 1e-5]

-n_threads	INT	Number of threads to use. [default: 1]

-version		Prints program version and exits.

-quick			Quick run. Only computes initial "freq" and "indF" values.
			NOTE: For debugging use only! Do not use in any analysis.

-verbose	INT	Selects verbosity level. [default: 1]
			NOTE: Values more than 4 print large amounts of data.
			For debugging use only!



### Input data
As input ngsF needs a Genotype Likelihood (GL) file, formatted as 3*n_ind*n_sites doubles in binary. Currently, all sites in the file must be variable, so a previous SNP calling step is needed.



### Stopping Criteria
An issue on iterative algorithms is the stopping criteria. ngsF implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).



### Example of genotype calling analysis on inbred samples

- Call SNPs (p-value < 1e-4) and get genotype likelihoods

% ./angsd -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -out input -GL 1 -doMajorMinor 1 -doGlf 3 -doPost 1 -doMaf 2 -doSNP 1 -minLRT 15.1366 -doZ 1


- Infer per-individual inbreeding coefficients

  1) Using true EM algorithm only

  % ./ngsF -n_ind INT -glf input.glf -out output_2.indF


  2) Using approximated EM step. This method was developed to speed up analysis by providing more accurate initial values

  % ./ngsF -n_ind INT -glf input.glf -out output_1.indF -approx_EM

  % ./ngsF -n_ind INT -glf input.glf -out output_2.indF -init_values output_1.indF.pars


- Call genotypes including per-individual inbreeding coefficients in the priors

% ./angsd_inbreeding -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -indF output_2.indF -out out -GL 1 -doMajorMinor 1 -doMaf 2 -doPost 1 -doGeno 2 -doZ 1




## Hints
- Dataset: as a rule of thumb, use at least 1000 high confidence independent SNP sites.

- Low coverage data: since the initial estimates are not reliable, it is recommended to use random starting points and more strict stopping criteria (eg. -init_values r -min_epsilon 1e-9).

- High coverage data: although F is not really useful in the prior, it seems lower initial values perform better (-init_values u).

- Memory Usage: By default ngsF loads the entire file into memory. However, if the file is too big and not enough memory is available, ngsF can also load chunks as they are needed. This is implemented on the BGZF library (from SAMTOOLS package), which allows for fast random access to BGZIP compressed files through an internal virtual index. This library can only deal with BGZIP files but a binary to compress them is provided.
If you want to use this library just add -D_USE_BGZF to the FLAGS on the Makefile.