GitHub - Jackson1992/CLAIMS: Claims is an in-memory parallel database prototype, which runs on clusters of shared-nothing servers and provides efficient and scalable data analysis.

CLAIMS is a parallel in-memory database prototype, which runs on clusters of commodity servers and provides fast data analysis on relational dataset.

##Highlights

1. Massively parallel execution engine.

CLAIMS relies on highly parallel query processing engine to dramatically accelerate data analysis speed. Query evaluations are distributed across the cluster and executed in parallel. Even if on each node which constructs a mini-network (e.g., multi-processor, multi-core, NUMA), queries are evaluated in a multi-thread fashion to leverage the computation power of multi-core hardware.

2. Smart intra-node parallelism.

Pipelining the query execution among the clusters could effectively improve the query respond time, but the benefits will be discounted if the workloads among execution fragments are imbalanced due to the improper intra-node parallelism. To tackle this problem, a novel elastic pipelining is proposed in CLAIMS to automatically adjust the intra-node parallelism of each query according to the runtime workload. Thanks to elastic pipelining, execution fragments which are detected to the performance bottleneck of the whole query will be given more parallelism to accelerate the data processing, while the parallelism of execution fragments that are detected to be over-producing will be decreased to avoid unnecessary computation allocation.

Pipelining the query execution among nodes in the cluster effectively reduces the response latency and dramatically saves storage space for intermediate query results. However, its benefits degrade tremendously when the workloads are imbalanced among execution partitions due to the improperly generated query execution plan. To tackle this problem, a novel elastic pipelining query processing model is proposed in CLAIMS, which adapts the intra-node parallelism to the runtime workload. Beneficial from elastic pipelining query processing, the parallelism of different execution fragments in a pipelined is self-adaptive with each other and results in an optimal intra-node parallelism assignment.

3. Efficient in-memory data processing.

Loong employs a large set of optimization techniques to achieve efficient in-memory data processing, including batch-at-a-time processing, cache-sensitive operators, SIMD-based optimization, code generation, lock-free and concurrent processing structures. These optimizations collaborate together and enable Loong to process up to gigabytes data per second within a single thread.

4. Network communication optimization.

Parallel query processing imposes high burdens on network communication, which becomes the performance bottleneck for in-memory parallel databases due to the relatively slow network bandwidth comparing to the efficient in-memory data processing throughput. When compiling a user query into an execution plan, CLAIMS’s query optimizer leverages a sophisticated selectivity propagation system and cost model to generate physical query plans with minimized network communication cost. Furthermore, Loong deploys a new data exchange implementation, which offers efficient, scalable and skew-resilient network data communication among Loong instances. These optimizations greatly reduce the response time of the queries that require network data communication.

#Performance Beneficial from the smart and massively parallelism and the in-memory data processing optimizations, CLAIMS is up to 5X faster than Shark and Impala, two state-of-the-art systems in the open source community, in the queries over TPCH dataset and Shanghai Stock Exchange dataset.

##Team members Aoying Zhou, a professor in East China Normal University, is the person in charge of this project.

Minqi Zhou, an associate professor in East China Normal University, is the person in charge of this project.

Li Wang, a Ph.D. student in East China Normal University, manages the master students in this team and is responsible for designing and implementing the key components of Loong, including query optimizer, catalog, physical operators, distributed communication infrastructure, storage layout, etc.

Lei Zhang is responsible for designing and implementing the key components of Loong, including query optimizer, physical operators, persistent data exchange, storage management, etc.

Shaochan Dong is responsible for designing and implementing in-memory index and index management, data types, as well as data loading and importing.

Xinzhou Zhang is mainly responsible for web UI design and implementing data importing model.

Zhuhe Fang is mainly responsible for designing and implementing SQL DML parser and physical operators.

Yu Kai is mainly responsible for designing and implementing SQL DDL parser, catalog persistence.

Yongfeng Li was a formal member of CLAIMS, who participated in designing and implementing catalog model.

Lin Gu is responsible for designing the demo cases of CLAIMS.

##Publications

Li Wang, Minqi Zhou, Zhenjie Zhang, Yin Yang, Ming-chien Shan, Aoying Zhou. Elastic Pipelining in In-Memory Database Cluster. Submitted to SIGMOD 2015.
Li Wang, Minqi Zhou, Zhenjie Zhang, Ming-chien Shan, Aoying Zhou. NUMA-aware Scalable and Efficient Aggregation on Large Domains. Accepted by TKDE.
Li Wang, Lei Zhang, Chengcheng Yu, Aoying Zhou. Optimizing Pipelined Execution for Distributed In-memory OLAY System. In: DaMen 2014. Springer. 2014. pp. 35-56.
Lan Huang, Ke Xun, Xiaozhou Chen, Minqi Zhou, In-memory Cluster Computing: Interactive Data Analysis, Journal of East China Normal University, 2014
Shaochan Dong, Minqi Zhou, Rong Zhang，Aoying Zhou, In-Memory Index：Performance Enhancement Leveraging on Processors, Journal of East China Normal University,2014
Xinzhou Zhang, Minqi Zhou,A Survey of Fault Tolerance and Fault Recovery Technique in Large Distributed Parallel Computing System, Journal of East China Normal University,2014
Lei Zhang, Minqi Zhou, Li Wang, LCDJ: Locality Conscious Join Alogrithm for In-memory Cluster Computing, Journal of East China Normal University, 2014
Lei Zhang, Zhuhe Fang, Minqi Zhou，Lan Huang, Join Algorithms Towards In-memory Computing, Journal of East China Normal University, 2014
Yongfeng Li, Minqi Zhou, Hualiang Hu, Survey of resource uniform management and scheduling in cluster, Journal of East China Normal University, 2014

Quick Start

Try our CLAIMS, following Quick Start.

Name		Name	Last commit message	Last commit date
Latest commit History 1,362 Commits
Catalog		Catalog
Client		Client
Daemon		Daemon
Executor		Executor
IndexManager		IndexManager
Loader		Loader
Parsetree		Parsetree
Resource		Resource
Test		Test
codegen		codegen
common		common
conf		conf
doc		doc
logical_operator		logical_operator
physical_operator		physical_operator
storage		storage
utility		utility
.gitignore		.gitignore
Authors		Authors
Client.cpp		Client.cpp
Config.cpp		Config.cpp
Config.h		Config.h
Debug.h		Debug.h
Environment.cpp		Environment.cpp
Environment.h		Environment.h
IDsGenerator.cpp		IDsGenerator.cpp
IDsGenerator.h		IDsGenerator.h
Makefile.am		Makefile.am
README		README
README.md		README.md
Server.cpp		Server.cpp
build.sh		build.sh
config.h.in		config.h.in
configure.ac		configure.ac
configure.h		configure.h
startup.cpp		startup.cpp
startup.h		startup.h

Jackson1992/CLAIMS

Folders and files

Latest commit

History

Repository files navigation

1. Massively parallel execution engine.

2. Smart intra-node parallelism.

3. Efficient in-memory data processing.

4. Network communication optimization.

Quick Start

About

Resources

Stars

Watchers

Forks

Languages