Skip to content

um-tech-evolution/slatkin-exact-tools

 
 

Repository files navigation

slatkin-exact-tools

Command line tools and Python module for Montgomery Slatkin's "Exact" test of neutrality in a set of gene or trait counts

Author: Mark E. Madsen, Dept. of Anthropology, University of Washington

Acknowledgements

Thanks to Alden H. Wright and colleagues, Dept. of Computer Science, University of Montana, for contributing bug fixes and a Makefile for the test code.

Goal

This package is a refactoring of Montgomery Slatkin's code for an "Exact" test for neutrality.

The refactoring allows the code to be used as a library, in addition to the original use as a console/command line program. The original "enumerate" code is unchanged, and preserved for testing against modifications that I make to the monte carlo version.

The monte carlo version has been migrated into slatkin_impl.h and slatkin_impl.c, with a single "API" function to call: slatkin_mc(), which takes the number of monte carlo replications, and an array of integers representing the allele/trait counts. The function returns a simple struct with members probability, and the theta_estimate.

There is also a C language function called montecarlo() which wraps slatkin_mc(), taking a slightly different set of arguments. This is meant specifically to be easy for SWIG to construct Python and other language modules from, but it might also be useful in C itself given that it takes int** instead of int[].

The remainder of the original functions are located in slatkin.c, and are considered "private" -- since C can't enforce that, just don't use them. Many of them have interesting conventions for the arguments, and I have not rewritten any of this code, just refactored it into packages and added scaffolding.

Output

Slatkin's original output format for the monte carlo program has been changed to output two numbers, separated by a tab. The first number is the P_e probability for the observed configuration given the Ewens Sampling Formula, and the second is the estimate of "theta" given equation 9.26 in Ewens's 1979 book (the equation is also in Ewens 2004, of course, but with a different equation number which escapes me for the moment.)

This modification is designed to allow easy scripting of Slatkin Exact tests, since the STDOUT results are "clean" and can be easily parsed by the next step in a script. I've used this method to access Slatkin tests from within R as well, with good results.

Python Module

The build process constructs a Cython-based extension module, wrapping the C language montecarlo() function, and constructing a wrapper module. The python implementation works as follows:

from slatkin import ewens_montecarlo

# Do 100K monte carlo replicates
mc_reps = 100000 

counts = [91, 56, 27, 9, 7, 3, 1]

(prob, theta) = ewens_montecarlo(mc_reps, counts)

print "prob: %s    theta: %s" % (prob, theta)

The python module is constructed most easily by running make python or make all, and is installed by running make pyinstall. The latter will install the built module to the user's current Python distribution, as determined by the python interpreter in the user's path. In some cases, you may need sudo make pyinstall to make this work, or to make the module available for all users on the system.

Julia Bindings

The file slatkin.jl contains code that allows the Slatkin Monte Carlo method to be called from code written in Julia. Currently, the bindings are not structured as a proper Julia package, but they can be used with the following steps:

  1. Issue make shared to produce slatkin.so
  2. Copy or link slatkin.jl into your project (or just use the full path to the file in the next step)
  3. Import the Julia bindings with include("slatkin.jl")
  4. When you run your script, you must update LD_LIBRARY_PATH to include the directory that contains slatkin.so, so maybe LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/path/to/slatkin-exact-tools"

In the future a full Julia package might be produced.

To run the Julia test file, navigate to the test/ subdirectory and run montecarlo_test.jl.sh, which sets the LD_LIBRARY_PATH variable appropriately before running montecarlo_test.jl.

Julia Caveats

The C algorithm relies on 32 bit integers, so the Julia code, for now, requires 32 bit values. This can be managed as follows:

counts = Int32[8, 4, 3, 2, 2, 1, 1, 1, 1, 1, 1]
result = ewens_montecarlo(Int32(100000), counts)

Keep in mind that all the code here is released under the GPL, so there may be implications if you decide to distribute your Julia application. You may want to include the full source in your project, or even just include this repository as a submodule if you are using Git.

Compatibility with Version 1.2 and Previous Revisions

Prior to Version 1.3, I was using SWIG to generate the Python to C wrapper code and frankly, the typemaps for getting the list in and out are confusing, so I took the easy way out and passed the length of the list as a parameter to the montecarlo() function.

In Version 1.3, I rewrote the wrapper using Cython, which is a really fabulous system. Not only is the complexity vastly reduced, but once the OS X clang compiler handles OpenMP again, I'm set for parallelizing the monte carlo replicates across cores, which will help in situations with many trait counts and large numbers of replicates (which you'd want with many traits to get a stable answer).

The former syntax is used in a number of my running simulation codes, though, so for compatibility, the old calling convention is:

from slatkin import montecarlo
 
counts = [91, 56, 27, 9, 7, 3, 1]

(prob, theta) = montecarlo(mc_reps, counts, len(counts))

print "prob: %s    theta: %s" % (prob, theta)

Random Number Generator

I also noted that Slatkin was implementing his own PRNG in the original code. Today, there's no need to implement our own, and potentially get it wrong. Instead, I'm building in the Mersenne Twister MT19937 as implemented in mersenne.c, which carries the following attribution legend:

// The code as Shawn received it included the following notice:
//
//   Copyright (C) 1997 Makoto Matsumoto and Takuji Nishimura.  When
//   you use this, send an e-mail to <matumoto@math.keio.ac.jp> with
//   an appropriate reference to your work.
//   It would be nice to CC: <Cokus@math.washington.edu> when you write.

Memory Management

Slatkin's original code was meant to be run once against a set of data, at which point the process exits. It was therefore unsuitable for use as a library, to be linked into a long running program. None of the heap memory was ever freed, so you'd leak memory like a sieve using this against 100,000 or 1MM samples.

This has been fixed, which means that it should be possible to write R or Python extensions to wrap the slatkin-common files. This will be done in this repository in upcoming releases.

License

Given that Slatkin put no explicit license in his code, and seemingly placed it in the public domain, I have considered it OK to re-mix and refactor his code in this manner, and re-release the result. I am attaching an GNU General Public License (version 3.0) to this repository and code, to ensure that others will be able to use the code, and pay forward Slatkin's generosity in releasing the code in the first place. This also matches the license given in the Mersenne Twister code.

About

Command line tools and Python module for Montgomery Slatkin's "Exact" test of neutrality in a set of gene or trait counts

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 77.6%
  • Python 9.1%
  • Jupyter Notebook 7.5%
  • Makefile 2.9%
  • Julia 2.3%
  • C++ 0.4%
  • Shell 0.2%