Skip to content
This repository has been archived by the owner on Apr 8, 2018. It is now read-only.

yzhang1991/gamma

Repository files navigation

The "Gamma" operator

Last updated: April 7, 2018

This repository contains almost all the code I wrote during my Ph.D. study at the University of Houston for the "Gamma" operator project.

The "Gamma" operator is a matrix operator that can be used to generate a summarization matrix (which we call the "Gamma" matrix) for a given input matrix. This "Gamma" matrix can be used as an intermediate matrix for computing many linear machine learning models including linear regression, PCA, Naive Bayes Classifier, K-means Clustering, etc.

This research has been published into several papers, listed below:

  • The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics
    Carlos Ordonez, Yiqun Zhang, Wellington Cabrera
    IEEE Transactions on Knowledge and Data Engineering (TKDE)
    28(7): 1905-1918 (2016) [IEEEXplore] [PDF]
  • A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
    Yiqun Zhang, Carlos Ordonez, Lennart Johnsson
    IEEE Proceedings of the DEXA Workshop 2017 [PDF]
  • The Gamma Operator for Big Data Summarization on an Array DBMS
    Carlos Ordonez, Yiqun Zhang, Wellington Cabrera
    Journal of Machine Learning Research (JMLR): Workshop and Conference Proceedings (BigMine 2014: 88-103) [PDF]
  • Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
    Carlos Ordonez, Yiqun Zhang
    Proc. Alberto Mendelzon International Workshop on Foundations of Data Management (AMW), 2016 [PDF]
  • Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language
    Yiqun Zhang, Carlos Ordonez, Wellington Cabrera
    IEEE International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016 [IEEEXplore] [PDF]

Repository structure

The earliest work of this project was published in the BigMine 2014 paper. The "Gamma" operator was initially written in C++, running on SciDB. You can find those operators in the gamma-scidb directory, in different versions (dense/sparse), even with GPU acceleration (OpenACC). We later wanted to compare this SciDB implementation with Spark and Vertica, so there they are, the gamma-spark and the gamma-vertica directory. Also, there is a ScaLAPACK prototype authored by a previous student Hadi Montakhabi in the scalapack-gamma directory. We tried SciDB and Vertica for the K-means Clustering, but I believe the Vertica version was not done. In the scidb-udos folder, I included my customized SciDB operator for 2-D array loading as well as some other operators that I had fun with while learning how to write SciDB operators. In the tools folder, I uploaded some scripts I used to help with my development or experiments. It also contains some proof-of-concept little programs.

Contact

I apologize for not having enough time to polish all that source code and to provide very detailed documentations. The code here is not so well engineered in my today's view, but it carries all my beautiful memories for my Ph.D. life. If you are interested in any of those work, please contact Dr. Carlos Ordonez via emails to carlos at central dot uh dot edu.

About

All the code I wrote for my Ph.D. project: "the Gamma operator"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published