GitHub - albertzaharovits/dsymv_benchmarking: BLAS dsymv benchmarking on 3 architectures: nehalem, opteron and quad. It has graph and sources of 2 personal implementation. It is a continuation of a school assignment.

albertzaharovits / dsymv_benchmarking Public

BLAS dsymv benchmarking on 3 architectures: nehalem, opteron and quad. It has graph and sources of 2 personal implementation. It is a continuation of a school assignment.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
out		out
DSYMV_BENCHMARK_NEHALEM_1.eps		DSYMV_BENCHMARK_NEHALEM_1.eps
DSYMV_BENCHMARK_NEHALEM_2.eps		DSYMV_BENCHMARK_NEHALEM_2.eps
DSYMV_BENCHMARK_OPTERON_1.eps		DSYMV_BENCHMARK_OPTERON_1.eps
DSYMV_BENCHMARK_OPTERON_2.eps		DSYMV_BENCHMARK_OPTERON_2.eps
DSYMV_BENCHMARK_QUAD_1.eps		DSYMV_BENCHMARK_QUAD_1.eps
DSYMV_BENCHMARK_QUAD_2.eps		DSYMV_BENCHMARK_QUAD_2.eps
Makefile_nehalem		Makefile_nehalem
Makefile_opteron		Makefile_opteron
Makefile_quad		Makefile_quad
README		README
dsymv_benchmark.c		dsymv_benchmark.c
exec_script.sh		exec_script.sh
exec_script_nehalem.sh		exec_script_nehalem.sh
exec_script_opteron.sh		exec_script_opteron.sh
exec_script_quad.sh		exec_script_quad.sh
ibm-nehalem.cpuinfo		ibm-nehalem.cpuinfo
ibm-opteron.cpuinfo		ibm-opteron.cpuinfo
ibm-quad.cpuinfo		ibm-quad.cpuinfo
my_dsymv.c		my_dsymv.c
mymm_dsymv.c		mymm_dsymv.c
mypp_dsymv.c		mypp_dsymv.c
run.sh		run.sh

Repository files navigation

BLAS DSYMV benchmarking on the Nehalem, Opteron, and Quad architectures.

Albert Zaharovits albert.zaharovits@gmail.com
7.4.2013

Directory structure :
        - out/ output files of the run.sh script
        - Makefiles for each architecture: Makefile_<architecture>
        - running scripts for each architecture: exec_script_<architecture>
        - charts for each individual architecture and matrice sizes up to 16k or 36k
Source structure :
        - the main source dsymv_benchmark.c, calls all other DSYMV functions
        - my_dsymv.c is the straight-forward implementation and compiled without compiler optimizations
        - mypp_dsymv.c is the improved implementation: the matrix A is crossed only once + compiler optimizations -O3 -msse3
        - mymm_dsymv.c is another improved implementation: the interior loop is parallelized using SSE registers + compiler optimizations -O3 -msse3

Observations :
        I implemented two optimizations, the first: loops the matrix once and takes the
branch out from the inner loop, while the second: inner-loop is SSE parallelized, although
in order to achieve this, the matrix should be crossed 2 times. GCC compiler does not
SSE autoparallelize (directive: -ftree-vectorizer-verbose=10)
Implementations are more restrictive than the Blas, but can be extended without
significant overhead. The biggest limitation is N == LDA , that is the square symmetric matrix

Graphs illustrate performances of the three relevant implementations: cblas standard version, inner branch optimization,
and SSE parallelization. Straight-forward implementation is way too slow and could broken graphs layout. 
The /proc/cpuinfo on each chassis should order runtimes Nehalem < Opteron < Quad. Nehalem has a huge cache of 12000KB 
compared to the 512KB of the Opteron architecture, with the added benefit of the CPU clock of 3000MHz compared to the 2600MHz 
frequency of the Opteron.
Each test is run five times and the average runtime is displayed.
To test only the running time, and not waste time with errors checking, the NDEBUG flag is defined

For scales up to 16K in matrix size, the original cblas implementation is the fastest; 
I suspect it "handles" cache very well, performing prefetches.
However, this kind of optimizations are difficult to implement; another "suspicion" (not checked),
but as you can see from the graphs, for large arrays, these prefetch schemes are not helpful. 
Again I suspect these prefetch schemes are responsible for the spikes the cblas implementation exibits 
on all the 3 architectures.

About

BLAS dsymv benchmarking on 3 architectures: nehalem, opteron and quad. It has graph and sources of 2 personal implementation. It is a continuation of a school assignment.

Readme

Activity

0 stars

3 watching

0 forks

Report repository