GitHub - sedykh/datasplit: toolbox to prepare data sets (matrices) for modeling

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
apstring.cpp		apstring.cpp
apstring.h		apstring.h
apvector.cpp		apvector.cpp
apvector.h		apvector.h
bonds.h		bonds.h
core.cpp		core.cpp
core.h		core.h
dataset.cpp		dataset.cpp
dataset.h		dataset.h
datasplit.cpp		datasplit.cpp
dscore.cpp		dscore.cpp
dscore.h		dscore.h
hash.cpp		hash.cpp
hash.h		hash.h
input_demo1.xa		input_demo1.xa
input_demo2.a		input_demo2.a
input_demo2.x		input_demo2.x
input_demo3.svm		input_demo3.svm
leax.h		leax.h
list.cpp		list.cpp
list.h		list.h
makefile		makefile
makefile64		makefile64
matrix.h		matrix.h
qsar.cpp		qsar.cpp
qsar.h		qsar.h
readme.txt		readme.txt
set.cpp		set.cpp
set.h		set.h
stack.cpp		stack.cpp
stack.h		stack.h
typedefs.h		typedefs.h

Repository files navigation

Datasplit is a "swiss-knife" tool for processing and preparation of data sets.
Datasplit works with descriptor matrices of the following formats:
.svm, .xa, .x (see input_demo* files)

To print detailed help, run datasplit without parameters.
"-M=" command key specifies the mode work,
additional help is available for each mode by running "datasplit -M=.."

-------
Usage examples:
1) Scaling descriptors and removing those with less than 5 unique values:
	"datasplit input_demo1.xa -M=N -MINO=5 -OUT=scaled.xa"

2) Splitting a dataset into random 5-fold modeling and external sets:
	"datasplit input_demo1.xa -M=L -N=5 -4EXT -OUT=ecv5f.list"

3) Splitting a modeling set into several internal training and test subsets:
	"datasplit input_demo1.xa -M=S -N=10 -OUT=ecv5f.t2t"

4) Splitting a dataset once by sampling randomly 9 points out:
	"datasplit input_demo1.xa -M=R -S=9 -N=1 -OUT=r1split.list"

NB: Examples 2-4) produce "split-files" (.t2t or .list) which can be used as input for knn+ modeling

Datasplit help print out (just run "datasplit")
---------------------------------------------
#Datasplit V4.14 - Divides data into training and test sets.
Usage:     datasplit filename [flags]
Default output: filename.t2t
Allowed input: .xa files or .x & .a files with matching names.
NB: Do not use blanks or tabs in descriptor or chem.names!

Possible flags are:
'-OUT=' specifies output name, use '.list'-ending to output in old format

'-M=m', m <METHOD> can be:
'S'  = Sphere-Exclusion split <default>
'R'  = Randomly pick test-set
'L'  = Leave-Group-Out split
'Y'  = Y-randomization (progressive scrambling)
'T'  = Transfer input matrix format into output format
'N'  = Normalize input matrix
'U'  = Remove duplicate points
'O'  = Remove outliers
'G'  = Merge two descriptor-matrices
'X'  = Partition based on values of a specified descriptor.
'-N=n', n is the number of splits to do <default 10>
'-S=s'	= Test-set size or fraction, where 's' is a real or integer number <default 0.1>
'-A=..' = Use activity-bins <1-255> for stratified sampling, i.e.'-A=2R' sets 2 bins of
equal act.range ('R'-flag); <default: bins with equal #dpoints>

Parameters related to 'Sphere-Exclusion' or 'Outlier-Removal':
'-Z=zx' = Metric to use; x - is a power coff. <default 2.0>
z: E -Euclidean <default>, T -Tanimoto, R -Corr., C -Cosine
'-R=rx', sphere's radius, where x is initial scale-factor <default 0.01>
r is 'U','D','V': User-defined <by x>, Distance- or Volume-based <default>

'-4EXT': to use 'external/modeling' naming instead of 'test/train'
'-SRND=..': to seed randomizer <default: by time>
Also, for more specific help run with -M= argument.