GitHub - rlichtenwalter/pgsql_genomics: Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
c		c
dat		dat
dat_gen		dat_gen
ddl		ddl
dml		dml
dml_gen		dml_gen
lib/tinyint-0.1.1		lib/tinyint-0.1.1
recipes		recipes
.gitignore		.gitignore
CHANGES		CHANGES
LICENSE		LICENSE
README		README
setup.sh		setup.sh

Repository files navigation

=== Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval ===

I. COMPATIBILITY
This software is compatible with PostgreSQL Releases 9.4, 9.5, and 9.6 on GNU/Linux x86_64. It has been tested on CentOS 6 and CentOS 7. Older versions of PostgreSQL will not work because of usage of 'WITH ORDINALITY', and newer versions will not work because of incompatibility with the current state of the TINYINT library. We hope to either remove the dependency or create a fix in the future. Other architectures and GNU/Linux variants may work but are untested.

II. USAGE
On a machine with a complete PostgreSQL 9.4, 9.5, or 9.6 installation, run the script 'setup.sh' in the top-level directory. You should have both the PostgreSQL server and the PostgreSQL client tools installed. Note that this script must be run as root because it copies files into privileged locations and uses sudo to execute psql as the database superuser.

III. BENCHMARKING NOTES
This distribution generates fake data that is roughly designed to represent the test data in the paper 'Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval'. The paper did *NOT* use fake data for its benchmarking. It used a real cohort of 3104 individuals and their unimputed and imputed genotypes, but privacy protections for the cohort and data volume require us not to release this data. Notably, the fake data generation uses a biased random number generation procedure to roughly represent the homogeneity that results from low alternate allele frequencies in variants. The fake sample-major data and variant-major data is *NOT* the same to facilitate fast loading for test purposes. The real benchmark data in the paper for sample-major and variant-major representations *WAS* the same.

IV. OTHER NOTES
The SQL code and C code here is very similar to but not identical to the code used in the paper. This is because the test platform was part of a much broader system incorporating many additional tables and types of data. Additionally, that system included a wide variety of extremely complex, optimized queries making use of CTEs, window functions, dynamic SQL, and other features of PostgreSQL to achieve extremely high-performance. If you are interested in obtaining some of this additional code for use with a built-out version of this code, please contact the authors. We strove to sanitize and minimize this release so that it was as simple as possible.

V. RECIPES
As requests for support on specific configurations arrive and as time permits, we will create precise recipes for proper configuration in the folder 'recipes'. These recipes will show the proper procedure assuming a minimal install of the listed operating system to make use of the software.

About

Efficient Storage of Genotypic Data in Relational Databases for Rapid Retrieval

Readme

MIT license

Activity

4 stars

1 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c

c

dat

dat

dat_gen

dat_gen

ddl

ddl

dml

dml

dml_gen

dml_gen

lib/tinyint-0.1.1

lib/tinyint-0.1.1

recipes

recipes

.gitignore

.gitignore

CHANGES

CHANGES

LICENSE

LICENSE

README

README

setup.sh

setup.sh

Repository files navigation

About

Releases

Packages

Languages

License

rlichtenwalter/pgsql_genomics

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages