Implementations of a number of algorithms for data mining which I wrote while working through the book Data Mining and Analysis by Mohammed J. Zaki and Wagner Meira Jr. There's also unit tests and three small programs that apply the algorithms to sample data.
All code, except for code in the "lib" directory, is open-source software under
the terms of the MIT License (see accompanying file LICENSE
).
The project source code is organized in the following directories:
src
: implementations of data mining algorithms,test
: unit tests for code insrc
,lib
: the Catch library used for unit tests,classification
: command-line program that uses classifier algorithms,itemset-mining
: command-line program that uses the itemset mining algorithms,clustering-gui
: GUI program that uses the clustering algorithms.
In addition to the Catch library, which is included in the repository, the following libraries are needed to build the code:
- Boost
- Eigen library for linear algebra,
- Qt 5 (only for clustering-gui).
The code uses features from C++11 and C++14, so you'll also need an up-to-date C++ compiler. I've tested it with GCC 4.9.1 and Clang 3.5.0.
The project contains cmake build files that can be used to build everything. It's usually best to build in a separate build directory. In the Linux (or OS X) shell, you can use these commands to build everything and run the unit tests:
mkdir build
cd build
cmake ..
make
./test/unittest
If you're building with Clang or GCC, you can enable the AddressSanitizer memory error detector by setting the variable USE_SANITIZER
to address
. On the command line, that would look something like this:
clang -DUSE_SANITIZER=address ..
You can enable the undefined behavior sanitizer by setting USE_SANITIZER
to undefined
:
clang -DUSE_SANITIZER=undefined ..
The src
directory contains implementations of the following algorithms:
- Itemset mining:
- Brute-force frequent itemset mining algorithm (
bruteforce.h
) - Apriori algorithm for frequent itemset mining (
apriori.h
) - Eclat algorithm for frequent itemst mining (
eclat.h
) - Declat algorithm for frequent itemset mining (
declat.h
) - Algorithm for generating association rules (
associationrules.h
)
- Brute-force frequent itemset mining algorithm (
- Clustering:
- K-Means clustering algorithm (
kmeans.h
) - Kernel K-Means clustering algorithm (
kernelkmeans.h
)
- K-Means clustering algorithm (
- Probabilistic classification:
- Bayes classifier (
bayesclassifier.h
) - Naive Bayes classifier (
naivebayesclassifier.h
) - K Nearest Neighbors classifier (
knnclassifier.h
)
- Bayes classifier (
- Kernel methods:
- homogeneous and inhomogeneous polynomial kernels, Gaussian kernel (
kernelfunction.h
)
- homogeneous and inhomogeneous polynomial kernels, Gaussian kernel (
Command-line program that applies the itemset mining algorithms to the "Anonymous Microsoft Web" sample dataset from the UC Irvine Machine Learning Repository. You can download the data and run the program like this:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/anonymous-msweb.data
./itemset-mining a 500 0.8 anonymous-msweb.data
Run itemset-mining
by iteself to get an overview of command-line arguments.
Command-line program that applies the classifiction algorithms to the "Skin Segmentation" sample dataset, also from the UC Irvine Machine Learning Repository. It divides the data into training and test datasets by simply using the last 30 % as test data, so it's a good idea to first randomize the order of the entries first. On the command line, you can do this using the shuf
program, which is part of GNU Coreutils:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00229/Skin_NonSkin.txt
shuf Skin_NonSkin.txt > shuffled.txt
./classification shuffled.txt n
Run classification
by iteself to get an overview of command-line arguments.
GUI program that applies the clustering algorithms to a set of 2D points specified by the user. Click in the main area of the GUI to set points, then click K-means or Kernel K-means to have one of the algorithms automatically divide the points into a configurable number of clusters.