forked from fgvieira/ngsF
-
Notifications
You must be signed in to change notification settings - Fork 0
Estimation of per-individual inbreeding coefficients under a probabilistic framework
License
mfumagalli/ngsF
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# ngsF ngsF is a program to estimate per-individual inbreeding coefficients, under a probabilistic framework that takes the uncertainty of genotype's assignation into account. ## Compile To compile entire package just run the Makefile: % make ## Run ngsF % ./ngsF [options] -n_ind INT -glf /path/to/glf/file -out /path/to/output/file ### Options -glf FILE Input GL file. Format as 3*n_ind*n_sites doubles in binary. It can be uncompressed [default] or in BGZIP format. If "-" reads uncompressed stream from STDIN. -out FILE Output file name. -n_ind INT Sample size (number of individuals). -n_sites INT Total number of sites. -chunk_size INT Size of each analysis chunk. [default: 10000] -call_geno Assume called genotypes. -approx_EM Use the faster approximated EM ML algorithm -fast_lkl Enables an intra-EM LogLkl calculation. NOTE: For debugging use only! LogLkl is calculated while estimating the next iteration parameters, speeding up the computation (no need for between-iteration de-novo LKL calculation). This LKL will reflect the previous iteration values, and will miss all skipped sites (f == 0). -init_values CHAR Initial values of individual F and site frequency: or r: random in (0,1) FILE e: estimated (not recommended for low coverage data) u: uniform = 0.01 [default] FILE: read from FILE (n_ind + n_sites doubles) -max_iters INT Maximum number of EM iterations. [default: 1500] -min_epsilon FLOAT Maximum parameters RMSD between iterations to assume convergence. [default: 1e-5] -n_threads INT Number of threads to use. [default: 1] -version Prints program version and exits. -quick Quick run. Only computes initial "freq" and "indF" values. NOTE: For debugging use only! Do not use in any analysis. -verbose INT Selects verbosity level. [default: 1] NOTE: Values more than 4 print large amounts of data. For debugging use only! ### Input data As input ngsF needs a Genotype Likelihood (GL) file, formatted as 3*n_ind*n_sites doubles in binary. Currently, all sites in the file must be variable, so a previous SNP calling step is needed. ### Stopping Criteria An issue on iterative algorithms is the stopping criteria. ngsF implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9). ### Example of genotype calling analysis on inbred samples - Call SNPs (p-value < 1e-4) and get genotype likelihoods % ./angsd -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -out input -GL 1 -doMajorMinor 1 -doGlf 3 -doPost 1 -doMaf 2 -doSNP 1 -minLRT 15.1366 -doZ 1 - Infer per-individual inbreeding coefficients 1) Using true EM algorithm only % ./ngsF -n_ind INT -glf input.glf -out output_2.indF 2) Using approximated EM step. This method was developed to speed up analysis by providing more accurate initial values % ./ngsF -n_ind INT -glf input.glf -out output_1.indF -approx_EM % ./ngsF -n_ind INT -glf input.glf -out output_2.indF -init_values output_1.indF.pars - Call genotypes including per-individual inbreeding coefficients in the priors % ./angsd_inbreeding -bam bam_files.list -only_proper_pairs 0 -minMapQ 0 -minQ 20 -baq 1 -C 50 -ref ref_seq.fas -indF output_2.indF -out out -GL 1 -doMajorMinor 1 -doMaf 2 -doPost 1 -doGeno 2 -doZ 1 ## Hints - Dataset: as a rule of thumb, use at least 1000 high confidence independent SNP sites. - Low coverage data: since the initial estimates are not reliable, it is recommended to use random starting points and more strict stopping criteria (eg. -init_values r -min_epsilon 1e-9). - High coverage data: although F is not really useful in the prior, it seems lower initial values perform better (-init_values u). - Memory Usage: By default ngsF loads the entire file into memory. However, if the file is too big and not enough memory is available, ngsF can also load chunks as they are needed. This is implemented on the BGZF library (from SAMTOOLS package), which allows for fast random access to BGZIP compressed files through an internal virtual index. This library can only deal with BGZIP files but a binary to compress them is provided. If you want to use this library just add -D_USE_BGZF to the FLAGS on the Makefile.
About
Estimation of per-individual inbreeding coefficients under a probabilistic framework
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published