GitHub - hcho3/dstedc_mgpu: Memory-efficient divide-and-conquer with automatic profiling

This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
c		c
cuda-old		cuda-old
magma		magma
matio		matio
matlab		matlab
prof		prof
testmat		testmat
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README		README
cfg.cu		cfg.cu
dlaed0_m.cu		dlaed0_m.cu
dlaed1_cpu.cu		dlaed1_cpu.cu
dlaed1_gpu.cu		dlaed1_gpu.cu
dlaed1_ph2.cu		dlaed1_ph2.cu
dlaed2.cu		dlaed2.cu
dlaed3_cpu.cu		dlaed3_cpu.cu
dlaed3_gpu.cu		dlaed3_gpu.cu
dlaed3_ph2.cu		dlaed3_ph2.cu
dlaed4_cpu.cu		dlaed4_cpu.cu
dlaed4_gpu.cu		dlaed4_gpu.cu
dlamrg.cu		dlamrg.cu
dlapy2.cu		dlapy2.cu
dstedc.h		dstedc.h
initial_guess_cpu.cu		initial_guess_cpu.cu
initial_guess_gpu.cu		initial_guess_gpu.cu
main.cu		main.cu
make.inc		make.inc
middle_way_cpu.cu		middle_way_cpu.cu
middle_way_gpu.cu		middle_way_gpu.cu
nvtx.cu		nvtx.cu
nvtx.h		nvtx.h
params.cfg		params.cfg
safety.cu		safety.cu
safety.h		safety.h
tester.sh		tester.sh
timer.cu		timer.cu
timer.h		timer.h
workspace.cu		workspace.cu

Repository files navigation

Memory-efficient divide-and-conquer with automatic profiling

SYNOPSIS
    Divide-and-conquer algorithm is a numerically stable and efficient
    algorithm that computes the eigenvalues and eigenvectors of a symmetric
    tridiagonal matrix. We often face the situation where the input matrix fits
    into the main memory but not into the on-chip memory of a GPU device. We
    present an out-of-core implementation where only part of the input matrix
    is resident in GPU memory at any point in time. It works independently of
    the physical size of GPU memory, handling any size of input as long as it
    fits into the main memory. Work is dynamically allocated to multiple GPUs
    and CPU cores, taking account of available workspaces and progress of the
    algorithm. In addition, it delivers a performance comparable to that of
    conventional multi-GPU implementations for cases where workspaces fit into
    the GPU memory.

DOCUMENTATION
    See http://www.hyunsu-cho.org/dstedc.html.

DEPENDENCIES
    1. ATLAS (Automatically Tuned Linear Algebra Software)
       http://math-atlas.sourceforge.net/

    2. CUDA extention library (included in CUDA toolkit)
    
    3. gcc with OpenMP support

HOW TO COMPILE
    1. Open make.inc and edit the system paths (lines 2-4) as necessary.
       Also revise the compilation options (lines 7-8); if your NVIDIA GPU is
       Fermi or older, use sm_25 or lower. Adjust the flags (lines 14-16) if
       you'd like.

    2. Run make. It will build all the components.

HOW TO RUN
    A crucial component of this package is automatic profiling of the current
    machine configuration. The main program depends on the performance
    parameters that are automatically detected by the profiling component. Make
    sure to run the profiler first to obtain the needed parameters.

    1. (First time) Run ./profile. This may take some time. Performance
       parameters are saved to params.cfg. This step is needed only for the
       first time use.

    2. Run the main program ./dstedc. Command examples are found in tester.sh.

    3. Both inputs and outputs are stored as *.bin files. To read and write
       *.bin matrices from MATLAB, add matio/ to the working path and use
       read_bin() and write_bin().

       Example (MATLAB):
           addpath('./matio');
           D = read_bin('testmat/dlaed1/D_32768.bin');
           E = read_bin('testmat/dlaed1/E_32768.bin');
           A = diag(D) + diag(E,1) + diag(E,-1);
           write_bin('A.bin', A);


BINARY MATRIX FILE FORMAT
    The *.bin file format follows a very simplistic layout:

    * First 8 bytes: number of rows in the matrix
    * Next 8 bytes: number of columns in the matrix
    * All the following bytes: all entries of the matrix laid out in
      column-major format.

    The current implementation does not consider endian compatibility. All the
    *.bin files included in the package were generated in a little-endian
    machine and thus are incompatible with big-endian machines.


FUNCTION SUMMARY
    The organization follows that of LAPACK. We direct the reader to
    http://www.netlib.org/lapack/lawnspdf/lawn69.pdf for more information.
    - dlaed0: the entry point of the divide-and-conquer eigensolver
    - dlaed1: Call dlaed2 and dlaed3 so as to merge the eigen-decompositions of 
              two adjacent submatrices. Also back-transform the eigenvectors
              returned by dlaed3.
    - dlaed2: Perform deflation.
    - dlaed3: Call dlaed4 and then solve an inverse eigenvalue problem to obtain
              a set of eigenvectors.
    - dlaed4: Compute each eigenvalue in the merged eigendecomposition by
              solving the secular equation.