Skip to content
forked from fgvieira/ngsF

Estimation of per-individual inbreeding coefficients under a probabilistic framework

License

Notifications You must be signed in to change notification settings

mfumagalli/ngsF

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# ngsF

ngsF is a program to estimate per-individual inbreeding coefficients, under a probabilistic framework that takes the uncertainty of genotype's assignation into account.



## Compile
To compile entire package just run the Makefile:

% make



## Run ngsF

% ./ngsF [options] -n_ind INT -glf /path/to/glf/file -out /path/to/output/file



### Options

-glf		FILE	Input GL file. Format as 3*n_ind*n_sites doubles in binary.
			It can be uncompressed [default] or in BGZIP format. If "-" 
			reads uncompressed stream from STDIN.

-out		FILE	Output file name.

-n_ind		INT	Sample size (number of individuals).

-n_sites	INT	Total number of sites.

-chunk_size	INT	Size of each analysis chunk. [default: 10000]

-call_geno		Assume called genotypes.

-approx_EM		Use the faster approximated EM ML algorithm

-fast_lkl		Enables an intra-EM LogLkl calculation.
			NOTE: For debugging use only! LogLkl is calculated while 
			estimating the next iteration parameters, speeding up 
			the computation (no need for between-iteration de-novo LKL 
			calculation). This LKL will reflect the previous iteration 
			values, and will miss all skipped sites (f == 0).

-init_values    CHAR    Initial values of individual F and site frequency:
 		 or	 r:     random in (0,1)
		FILE	 e:     estimated (not recommended for low coverage data)
			 u:	uniform = 0.01 [default]
			 FILE:  read from FILE (n_ind + n_sites doubles)

-max_iters	INT	Maximum number of EM iterations. [default: 1500]

-min_epsilon	FLOAT	Maximum parameters RMSD between iterations to assume 
			convergence. [default: 1e-5]

-n_threads	INT	Number of threads to use. [default: 1]

-version		Prints program version and exits.

-quick			Quick run. Only computes initial "freq" and "indF" values.
			NOTE: For debugging use only! Do not use in any analysis.

-verbose	INT	Selects verbosity level. [default: 1]
			NOTE: Values more than 4 print large amounts of data.
			For debugging use only!



### Input data
As input ngsF needs a Genotype Likelihood (GL) file, formatted as 3*n_ind*n_sites doubles in binary. Currently, all sites in the file must be variable, so a previous SNP calling step is needed.



### Stopping Criteria
An issue on iterative algorithms is the stopping criteria. ngsF implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).



### Example of genotype calling analysis on inbred samples

- Call SNPs (p-value < 1e-4) and get genotype likelihoods

% ./angsd -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -out input -GL 1 -doMajorMinor 1 -doGlf 3 -doPost 1 -doMaf 2 -doSNP 1 -minLRT 15.1366 -doZ 1


- Infer per-individual inbreeding coefficients

  1) Using true EM algorithm only

  % ./ngsF -n_ind INT -glf input.glf -out output_2.indF


  2) Using approximated EM step. This method was developed to speed up analysis by providing more accurate initial values

  % ./ngsF -n_ind INT -glf input.glf -out output_1.indF -approx_EM

  % ./ngsF -n_ind INT -glf input.glf -out output_2.indF -init_values output_1.indF.pars


- Call genotypes including per-individual inbreeding coefficients in the priors

% ./angsd_inbreeding -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -indF output_2.indF -out out -GL 1 -doMajorMinor 1 -doMaf 2 -doPost 1 -doGeno 2 -doZ 1




## Hints
- Dataset: as a rule of thumb, use at least 1000 high confidence independent SNP sites.

- Low coverage data: since the initial estimates are not reliable, it is recommended to use random starting points and more strict stopping criteria (eg. -init_values r -min_epsilon 1e-9).

- High coverage data: although F is not really useful in the prior, it seems lower initial values perform better (-init_values u).

- Memory Usage: By default ngsF loads the entire file into memory. However, if the file is too big and not enough memory is available, ngsF can also load chunks as they are needed. This is implemented on the BGZF library (from SAMTOOLS package), which allows for fast random access to BGZIP compressed files through an internal virtual index. This library can only deal with BGZIP files but a binary to compress them is provided.
If you want to use this library just add -D_USE_BGZF to the FLAGS on the Makefile.

About

Estimation of per-individual inbreeding coefficients under a probabilistic framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published