Skip to content

Kognetics -> RDF Database Partitioning -> Adaptive Partitioning for Very Large RDF Data

Notifications You must be signed in to change notification settings

anukat2015/AdHash

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Adaptive Partitioning for Very Large RDF Data

Introduction

State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data pre-processing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; as a result, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning.

We propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.

For more details, visit http://cloud.kaust.edu.sa/Pages/adhash.aspx

License

AdHash is released under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

If you use AdHash in your research, please cite our paper:

 @article{DBLP:journals/pvldb/HarbiAKM15,
 author    = {Razen Harbi and
              Ibrahim Abdelaziz and
              Panos Kalnis and
              Nikos Mamoulis},
 title     = {Evaluating {SPARQL} Queries on Massive {RDF} Datasets},
 journal   = {{PVLDB}},
 volume    = {8},
 number    = {12},
 pages     = {1848--1859},
 year      = {2015},
 url       = {http://www.vldb.org/pvldb/vol8/p1848-harbi.pdf},
 timestamp = {Fri, 14 Aug 2015 15:24:11 +0200},
 biburl    = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/HarbiAKM15},
 bibsource = {dblp computer science bibliography, http://dblp.org}
}

Building

AdHash was tested on 64-bit Ubuntu Linux 12.04 (precise), 14.04 (trusty) and LinuxMint 17.01 (Rebecca).

Dependencies

There are a few dependencies which must be satisfied in order to compile and run AdHash.

  • build-essential and g++ (>= 4.4.7) [Required]

    • Needed for compiling AdHash.
  • openssh-server [Required]

    • Required to initialize MPI and establish connections among compute nodes.
  • MPICH2 [Required]

    • AdHash uses MPI for inter-node communication. Open MPI is not tested with AdHash.
  • Install a recent version of Boost (>= 1.46) [Required]

    • AdHash requires all headers of non-complied boost libraries (libboost-dev) and other compiled boost libraries. Specifically, AdHash uses boost's iostreams, system and filesystem libraries.
  • Qt (>= 5.4) [Required]

    • AdHash has a simple Qt-based GUI for managing, partitioning and querying RDF datasets. AdHash links dynamically to Qt.
  • lGL library [Required]

    • Needed by Qt for GUI rendering.
  • gnuplot (>= 4.4) [Required]

    • Needed to plot some images within AdHash GUI.

Satisfying Dependencies on Ubuntu and LinuxMint

All the dependencies can be satisfied from the Ubuntu/LinuxMint repositories:

 sudo apt-get update
 sudo apt-get install gcc g++ build-essential libopenssh-server libmpich2-dev libboost-dev libboost-iostreams-dev libboost-system-dev libboost-filesystem-dev libglu1-mesa-dev gnuplot git

Download Qt from http://www.qt.io/download/. Both online and offline installations were tested with AdHash.

Note: for MPI to work, all workers need to be able to ssh to each other without passwords.

Downloading AdHash

You can download AdHash from Github, which also offers a zip download of the repository through the website.

To clone from Github using git, execute the following command:

 git clone https://github.com/razen-alharbi/AdHash.git

Compiling and Running

  • Change your current working directory to the downloaded directory. This directory is referred to as ADHASH_HOME hereafter.
cd ADHASH_HOME
  • Generate the make files by running qmake.

qmake AdHashGUI.pro -r -spec linux-g++

* Compile AdHash. You can run multiple parallel build tasks. We used 4 parallel tasks.

make -j4

* Run the management console of AdHash by executing the following command. AdHash has a simple and self-explanatory GUI that is easy to follow for loading, partitioning and querying RDF datasets.

./Release/mgmt

## Running Issues
If your system's locale is not set properly, AdHash may fail to run. This was specifically noticed when installing the desktop-environment on ubuntu servers. To go around this, execute the following command:

export LC_ALL="en_US.UTF-8"


If you encounter issues please send an email to razen.harbi@kaust.edu.sa and ibrahim.abdelaziz@kaust.edu.sa

## Tested datasets
The following datasets/benchmarks were tested successfully by AdHash:
* LUBM Benchmark (http://swat.cse.lehigh.edu/projects/lubm/).
* WatDiv Benchmark (http://dsg.uwaterloo.ca/watdiv/).
* YAGO2 (http://yago-knowledge.org/).
* Bio2RDF (http://www.bio2rdf.org/).

About

Kognetics -> RDF Database Partitioning -> Adaptive Partitioning for Very Large RDF Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 59.2%
  • C 35.6%
  • Makefile 4.5%
  • Other 0.7%