GitHub - aaaaaaa323313/UniCAP: An Unified Computing Engine for Fast Data Processing

#UniCAP: A Unified Computing Engine for Fast Data Processing

The UniCAP project is a distributed computing engine for executing data parallel programmes, which consist of a complex of directed-acyclic-graph (DAG) of tasks. It unifies Batch Processing and Stream Processing in one system using a timed dataflow model. Clients can either use UniCAP in NTU Big Data Platform (BDP), or build from source codes.

The main design features of UniCAP are:

###Timed Dataflow

Timed Dataflow aims at reducing the communication overhead for iterative jobs with shared variables. In such jobs, the edge node reads input data and shared variables, and updates the shared variables at each iteration. This kind of application includes:

Supervised Machine Learning: To train a model, the supervised machine learning applications (e.g., logistic regression, artificial neural network, etc.) need to update the parameters, which are the shared variables, based on the input data and old parameters at each iteration.
Graph Processing: The graph processing jobs (e.g., single source shortest path, pagerank, etc.) usually update the graph nodes' weights, which are the shared variables, based on the graph data and old weights at each iteration.

Precious approaches like Hadoop, Spark have high communication overhead to fetch all the shared variables at each iteration. Our experiments show that the shared variable query time can take up to 60% of total execution time of an iteration. However, our experiments also show that a large part of shared variables are static values during the computation. However, previous approaches always fetch all the shared variables, even if there is no change.

Time Dataflow tackles this problem by adding logical timestamp to the shared variables. Thus, at each iteration, the computation nodes (which caches the old shared variables in previous iteration) only need to fetch the changed shared variables rather than all the shared variables. Experiments show that timed dataflow can accelerate logistic regression and pagerank 30% and 45% respectively, compared to Spark.

###Hybrid Vertex

UniCAP embodies a set of storage systems as its Vertexes. Generally, UniCAP can support both pure-memory and mix memory-disk storage.

###Hybrid Edge (In Development)

UniCAP supports schedule tasks using both GPU and CPU in a single application. Compared to other approaches, which can only use CPU or GPU, UniCAP maximize the system resource utilization.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
build		build
doc		doc
etc		etc
log		log
src		src
test		test
CMakeLists.txt		CMakeLists.txt
README.md		README.md
install.md		install.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

build

build

doc

doc

etc

etc

log

log

src

src

test

test

CMakeLists.txt

CMakeLists.txt

README.md

README.md

install.md

install.md

Repository files navigation

#UniCAP: A Unified Computing Engine for Fast Data Processing

About

Releases

Packages

Languages

aaaaaaa323313/UniCAP

Folders and files

Latest commit

History

Repository files navigation

#UniCAP: A Unified Computing Engine for Fast Data Processing

About

Resources

Stars

Watchers

Forks

Languages