GitHub - benjamingr/CAPS: Communication-Avoiding Parallel Strassen

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cholesky		cholesky
make.incs		make.incs
rect-class		rect-class
sparse		sparse
tests		tests
COPYING		COPYING
Makefile		Makefile
README		README
comm-test.cpp		comm-test.cpp
command-line-parser.h		command-line-parser.h
communication.cpp		communication.cpp
communication.h		communication.h
counters.cpp		counters.cpp
counters.h		counters.h
dgemm-blas.cpp		dgemm-blas.cpp
dgemm-blas.h		dgemm-blas.h
fromfile1.cpp		fromfile1.cpp
generateTest.py		generateTest.py
library.cpp		library.cpp
library.h		library.h
matrix.cpp		matrix.cpp
matrix.h		matrix.h
memory.cpp		memory.cpp
memory.h		memory.h
mpitest.cpp		mpitest.cpp
multiply.cpp		multiply.cpp
multiply.h		multiply.h
randombenchmark.cpp		randombenchmark.cpp
summa1d.cpp		summa1d.cpp
summa1d.h		summa1d.h
summatest.cpp		summatest.cpp
tags.h		tags.h

Repository files navigation

You have downloaded the source code for CAPS, Communication-Avoiding Parallel Strassen. It provides very high performance for parallel dense square matrix multiplication. It is best suited to massively parallel computation, with thousands to millions of cores. This is research code, and is not suitable for production use. The algorithm is described in:
http://arxiv.org/abs/1202.3173
and further details and performance benchmarks are given in:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-90.html

Compiling:
To compile CAPS, you must create a make.inc file. Examples are provided in the make.incs directory: one for gcc openmpi and atlas on linux, one for a Cray XT4, one for a Cray XE6, and one for an IBM BG/P. This file sets the compiler and compiler options. Additionally, there are three preprocessor directives to set here:
THREADS specifies how many threads to be used per mpi process. For best performance it should match the number of cores per process when running
SANITY enables several sanity checks in the code to abort on invalid input.
DAXPY uses the BLAS DAXPY routine instead of openmp to perform additions. Try both set and unset to find the best performance.
After placing a make.inc file in the base directory, to compile simply type 'make'.

Testing:
Use the executable 'fromfile1' to test the correctness of the code. It reads a matrix from a file on one process, distributes the data to the others, performs the multiplication, then collects the data and verifies or prints the output. The distribution of the data from one process to the rest is not efficiently implemented, and this executable should not be used for large problems where performance is important.
Required flags:
-i <input file>
-b <block size>
1 is the recommended choice
-r <recursive steps>
Optional flags:
-o <output file>
specifies a file to write the computed answer
-c <check file>
specifies a file to compare the computed answer to
if neither -c or -o is given, the answer is printed
-m <megabytes available>
-k <kilobytes available>
amount of memory the computation is allowed to use
used to determine the pattern of bfs, dfs, and hybrid steps
-p <pattern>
execution pattern, as a string of 'b', 'd', or 'h', length
should match the number of recursive steps, and the number of
'b's should match the number of powers of 7 in the number of
processes
The tests directory contains several tests of integer matrices (so the answer should be exactly correct). Each test has a .in file (to be used as the argument of -i), a .correct file (to be used as the argument of -c) and a .params file that specifies all the valid sets of parameters for this size. More tests can be generated by 'generateTest.py'. If a test fails, check if the parameters are among those listed in the .params file, or recompile with SANITY enabled to abort on invalid inputs.

Benchmarking:
Use the executable 'randombenchmark' to benchmark the code. It randomly generates a matrix locally on each process, performs the multiplication, then prints performance results. The parameters are somewhat restricted; requirements are listed below.
An example is:
mpirun -n 7 ./randombenchmark -s 7168 -r 3 -p bdd
Required flags:
-s <matrx dimension>
Optional flags:
-b <block size>
1 is the recommended choice, and the default
-r <recursive steps>
-m <megabytes available>
-k <kilobytes available>
amount of memory the computation is allowed to use
used to determine the pattern of bfs, dfs, and hybrid steps
-p <pattern>
execution pattern, as a string of 'b', 'd', or 'h', length
should match the number of recursive steps, and the number of
'b's should match the number of powers of 7 in the number of
processes
note that the pattern will determine the amount of memory used
Output:
The root process will print the amount of memory used, the pattern of BFS, DFS, and hybrid steps, the time to perform the multiplication, and actual and effective performance per process. Additionally, each process will print a breakdown of its time between communication, matrix addition, base-case matrix multiplication, and re-ordering of matrix elements.

Requirements on the matrix dimension and number of processes:
The matrix dimension must be a multiple of:
(2^r)*(7^ceil(b/2))*f
where r is the number of recursive steps, and b is the number of BFS steps, and the number of processes is:
f*7^b,
where f is 1, 2, 3, 4, 5, or 6.
Settings in violation of this may run, especialy if the SANITY option is disabled, but will not give the correct answer.

Use as a library:
Use as a library is not recommended. No efficient routines are provided for transforming between standard data layouts and those required by CAPS.

Extensions:
To see similar ideas applied to other algorithms see the subdirectories. Detailed compilation and running instructions are not provided; see the Makefile and the main files in each directory.
cholesky/: Recursive Cholesky factorization. The algorithm is described in Chapter 5 of http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-100.pdf
rect-class/: Recursive rectangular classical matrix multiplication. The algorithm and benchmarking data are described in http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-205.pdf
sparse/: Several sparse matrix multiplication algorithms are implemented. The algorithms are described in http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-13.pdf and benchmarking results are in Chapter 4 of http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-100.pdf