Skip to content

SharpLu/h5spark

 
 

Repository files navigation

#Alt text

  1. Support Hierarchical Data Format, HDF5/NetCDF4 and Rich Parallel I/O Interface in Spark
  2. Optimize I/O Performance on HPC with Lustre Filesystems Tuning

Input and Output

  1. Input is a tuple of (pathname or filename, variablename, numpartitions)
  2. Output is "A single RDD in which the element of RDD is one row in its original file(s)"

#Download and Compile H5Spark

  1. git pull https://github.com/valiantljk/h5spark.git
  2. cd h5spark
  3. sbt package

#Use in Pyspark Scripts Add this to your python path: export PYTHONPATH= path/to/h5spark/src/main/python/:$PYTHONPATH

Then import it in python like so:

  1. from h5spark import read
  2. from pyspark import SparkContext
  3. sc = SparkContext()
  4. rdd = h5read(sc,file_list_or_txt_file,mode='multi', partitions=2000)

#Use in Scala Codes

  1. export LD_LIBRARY_PATH=$LD_LBRARY_PATH:your_project_dir/lib
  2. cp h5spark/target/scala-2.10/h5spark_2.10-1.0.jar your_project_dir/lib/
  3. cp h5spark/lib/* your_project_dir/lib/
  4. add these lines in your codes: import org.nersc.io._

** Load as an array: val rdd = read.h5read (sc,inputpath, variablename, partition)

** Load as an indexedvector: val rdd = read.h5read_vec (sc,inputpath, variablename, partition)

** Load as an indexedrow: val rdd = read.h5read_irow (sc,inputpath, variablename, partition)

** Load as an indexedmatrix: val rdd = read.h5read_imat (sc,inputpath, variablename, partition)

#Sample Batch Job Script for testing on Cori

  1. Python version: sbatch spark-python.sh
  2. Scala version: sbatch spark-scala.sh

About

Supporting Hierarchical Data Format and Rich Parallel I/O Interface in Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 34.3%
  • Scala 26.5%
  • Java 13.9%
  • Python 13.9%
  • Shell 11.4%