- Support Hierarchical Data Format, HDF5/NetCDF4 and Rich Parallel I/O Interface in Spark
- Optimize I/O Performance on HPC with Lustre Filesystems Tuning
- Input is a tuple of (pathname or filename, variablename, numpartitions)
- Output is "A single RDD in which the element of RDD is one row in its original file(s)"
#Download and Compile H5Spark
- git pull https://github.com/valiantljk/h5spark.git
- cd h5spark
- sbt package
#Use in Pyspark Scripts Add this to your python path: export PYTHONPATH= path/to/h5spark/src/main/python/:$PYTHONPATH
Then import it in python like so:
- from h5spark import read
- from pyspark import SparkContext
- sc = SparkContext()
- rdd = h5read(sc,file_list_or_txt_file,mode='multi', partitions=2000)
#Use in Scala Codes
- export LD_LIBRARY_PATH=$LD_LBRARY_PATH:your_project_dir/lib
- cp h5spark/target/scala-2.10/h5spark_2.10-1.0.jar your_project_dir/lib/
- cp h5spark/lib/* your_project_dir/lib/
- add these lines in your codes: import org.nersc.io._
** Load as an array: val rdd = read.h5read (sc,inputpath, variablename, partition)
** Load as an indexedvector: val rdd = read.h5read_vec (sc,inputpath, variablename, partition)
** Load as an indexedrow: val rdd = read.h5read_irow (sc,inputpath, variablename, partition)
** Load as an indexedmatrix: val rdd = read.h5read_imat (sc,inputpath, variablename, partition)
#Sample Batch Job Script for testing on Cori
- Python version: sbatch spark-python.sh
- Scala version: sbatch spark-scala.sh