Skip to content

rlichtenwalter/pgsql_genomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

=== Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval ===

I. COMPATIBILITY
This software is compatible with PostgreSQL Releases 9.4, 9.5, and 9.6 on GNU/Linux x86_64. It has been tested on CentOS 6 and CentOS 7. Older versions of PostgreSQL will not work because of usage of 'WITH ORDINALITY', and newer versions will not work because of incompatibility with the current state of the TINYINT library. We hope to either remove the dependency or create a fix in the future. Other architectures and GNU/Linux variants may work but are untested.

II. USAGE
On a machine with a complete PostgreSQL 9.4, 9.5, or 9.6 installation, run the script 'setup.sh' in the top-level directory. You should have both the PostgreSQL server and the PostgreSQL client tools installed. Note that this script must be run as root because it copies files into privileged locations and uses sudo to execute psql as the database superuser.

III. BENCHMARKING NOTES
This distribution generates fake data that is roughly designed to represent the test data in the paper 'Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval'. The paper did *NOT* use fake data for its benchmarking. It used a real cohort of 3104 individuals and their unimputed and imputed genotypes, but privacy protections for the cohort and data volume require us not to release this data. Notably, the fake data generation uses a biased random number generation procedure to roughly represent the homogeneity that results from low alternate allele frequencies in variants. The fake sample-major data and variant-major data is *NOT* the same to facilitate fast loading for test purposes. The real benchmark data in the paper for sample-major and variant-major representations *WAS* the same.

IV. OTHER NOTES
The SQL code and C code here is very similar to but not identical to the code used in the paper. This is because the test platform was part of a much broader system incorporating many additional tables and types of data. Additionally, that system included a wide variety of extremely complex, optimized queries making use of CTEs, window functions, dynamic SQL, and other features of PostgreSQL to achieve extremely high-performance. If you are interested in obtaining some of this additional code for use with a built-out version of this code, please contact the authors. We strove to sanitize and minimize this release so that it was as simple as possible.

V. RECIPES
As requests for support on specific configurations arrive and as time permits, we will create precise recipes for proper configuration in the folder 'recipes'. These recipes will show the proper procedure assuming a minimal install of the listed operating system to make use of the software.

About

Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published