Input and Output

#

Support Hierarchical Data Format, HDF5/NetCDF4 and Rich Parallel I/O Interface in Spark
Optimize I/O Performance on HPC with Lustre Filesystems Tuning

Input and Output

Input is a tuple of (pathname or filename, variablename, numpartitions)
Output is "A single RDD in which the element of RDD is one row in its original file(s)"

#Download and Compile H5Spark

git pull https://github.com/valiantljk/h5spark.git
cd h5spark
sbt package

#Use in Pyspark Scripts Add this to your python path: export PYTHONPATH= path/to/h5spark/src/main/python/:$PYTHONPATH

Then import it in python like so:

from h5spark import read
from pyspark import SparkContext
sc = SparkContext()
rdd = h5read(sc,file_list_or_txt_file,mode='multi', partitions=2000)

#Use in Scala Codes

export LD_LIBRARY_PATH=$LD_LBRARY_PATH:your_project_dir/lib
cp h5spark/target/scala-2.10/h5spark_2.10-1.0.jar your_project_dir/lib/
cp h5spark/lib/* your_project_dir/lib/
add these lines in your codes: import org.nersc.io._

** Load as an array: val rdd = read.h5read (sc,inputpath, variablename, partition)

** Load as an indexedvector: val rdd = read.h5read_vec (sc,inputpath, variablename, partition)

** Load as an indexedrow: val rdd = read.h5read_irow (sc,inputpath, variablename, partition)

** Load as an indexedmatrix: val rdd = read.h5read_imat (sc,inputpath, variablename, partition)

#Sample Batch Job Script for testing on Cori

Python version: sbatch spark-python.sh
Scala version: sbatch spark-scala.sh

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
lib		lib
mpiio		mpiio
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
cug-structure		cug-structure
h5spark.iml		h5spark.iml
spark-python.sh		spark-python.sh
spark-scala.sh		spark-scala.sh
todo		todo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

mpiio

mpiio

project

project

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

cug-structure

cug-structure

h5spark.iml

h5spark.iml

spark-python.sh

spark-python.sh

spark-scala.sh

spark-scala.sh

todo

todo

Repository files navigation

Input and Output

About

Releases

Packages

Languages

License

SharpLu/h5spark

Folders and files

Latest commit

History

Repository files navigation

Input and Output

About

Resources

License

Stars

Watchers

Forks

Languages