Skip to content

lh3/naivepca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NaivePCA performs PCA for population genotype data. It implements the basic algorithm as is described by Patterson et al (2006). More precisely, suppose we have m samples of ploidy h and n biallelic markers. Let Gij be the number of non-reference alleles for sample i at marker j. NaivePCA computes:

\mu_j  = \sum_{i=1}^m G_{ij} / m
p_j    = \mu_j / h
M_{ij} = \frac{G_{ij}-\mu_j}{\sqrt{p_j(1-p_j)}}
X_{ij} = \sum_{k=1}^n M_{ik} M_{jk} / n

and finds the eigenvectors of matrix (Xij). Notably, if Gij is missing data, Mij takes zero and the computation of μj needs to be adjusted as well.

The input of NaivePCA looks like:

sample1  110022110100202021001122*201
sample2  2012201102*221020211*1222001

where a number represents a genotype and other characters are treated as missing data. For now, NaivePCA does not support real matrices. The output is TAB-delimited. The first column is the sample name. The i-th column gives the eigenvector corresponding the (i-1)-th largest eigenvalue.

About

Naive PCA for genotype data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published