Skip to content

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

License

shurickdaryin/scr

 
 

Repository files navigation

========================================================================
Scalable Checkpoint / Restart (SCR) Library
========================================================================

The Scalable Checkpoint / Restart (SCR) library enables MPI applications
to utilize distributed storage on Linux clusters to attain high file I/O
bandwidth for checkpointing and restarting large-scale jobs. With SCR,
jobs run more efficiently, lose less work upon a failure, and reduce
load on critical shared resources such as the parallel file system and
the network infrastructure.

In the current SCR implementation, application checkpoint files are
cached in storage local to the compute nodes, and a redundancy scheme is
applied such that the cached files can be recovered even after a failure
disables part of the system. SCR supports the use of spare nodes such
that it is possible to restart a job directly from its cached
checkpoint, provided the redundancy scheme holds and provided there are
sufficient spares.

On current platforms, SCR scales linearly with the number of compute
nodes. It has been benchmarked at 1216GB/s on 992 nodes of Coastal at
Lawrence Livermore National Laboratory (LLNL). At this scale, it is two
orders of magnitude faster than the parallel file system.

The SCR library is written in C, and it currently only provides a C
interface. It has been used in production at LLNL since late 2007 using
RAM disk on Linux / x86-64 / Infiniband clusters. Production runs are
now underway using solid-state drives on the same cluster architecture.
The implementation assumes SLURM is used as the resource manager. The
library is designed and intended to be ported to other platforms and
resource managers. SCR is an open source project under a BSD license
hosted at: http://scalablecr.sourceforge.net.

Detailed usage is provided in the user manual (scr_users_manual.pdf),
and detailed implementation details are in the develope's manual
(scr_developers_manual.pdf).

-----------------------------
Dependencies
-----------------------------

SCR depends on a number of other software packages.

Required:

  pdsh -- parallel remote shell command (pdsh, dshbak)
    http://sourceforge.net/projects/pdsh
    e.g.,
      wget http://downloads.sourceforge.net/project/pdsh/pdsh/pdsh-2.18/pdsh-2.18.tar.bz2
      bunzip2 pdsh-2.18.tar.bz2
      tar -xf pdsh-2.18.tar
      pushd pdsh-2.18
      ./configure --prefix=/tmp
      make
      make install

  Date::Manip -- Perl module for date/time interpretation
    http://search.cpan.org/~sbeck/Date-Manip-5.54/lib/Date/Manip.pod

  SLURM -- resource manager
    http://sourceforge.net/projects/slurm

Optional:

  libyogrt -- your one get remaining time library
    Not currently available outside of LLNL.
    Enables a running job to determine how much time it has left
    in its resource allocation.

  io-watchdog -- tool to catch and terminate hanging jobs
    http://code.google.com/p/io-watchdog
    Spawns a thread that can kill an MPI process (and hence
    the job) when it detects that the process has not written
    data is some specified amount of time.  This is very useful
    to catch and kill hanging jobs, so that SCR can restart the
    job or fetch the latest checkpoint.

  MySQL -- for logging events
    http://www.mysql.com

SCR is designed to be ported to other resource managers, so with some
work, one may replace SLURM with something else.  Also libyogrt is
optional and similarly can be replaced with something else.

-----------------------------
Configuration
-----------------------------

Most of the default settings in the SCR library can be adjusted.  One
may customize default settings via configure options, compile-time
defines, or a system configuration file.

In particular, the following parameters likely need to be adjusted for
other platforms:

Cache and control directories:
  It is critical to configure SCR with directory paths it should use
  for cache and control directories.  Each of these directories should
  correspond to a path leading to a storage device that is local to the
  compute node.  For more information, see the SCR directories section
  below.

scr.conf:
  A number of options can and must be specified for an SCR installation.
  A convenient method to set these parameters is to create a system
  configuration file.  To help you get started, an example config file
  is provided in scr.conf.  Copy this file, customize it with your
  desired settings, and configure your SCR build to know where to look
  for it using the --with-scr-config-file configure option, e.g.,

  configure --with-scr-config-file=/etc/scr.conf ...

libyogrt support:
  If libyogrt is available, specify --with-yogrt[=PATH] during the
  configure.

  The function of libyogrt is to enable a job to determine how much
  time it has left in its current allocaiton.  While libyogrt is not
  available outside of LLNL, other computer centers often have some
  mechanism that provides similar functionality.  It should be straigt-
  forward to replace calls to libyogrt with something else.  One must
  simply modify the scr_env_seconds_remaining() implementation in
  src/scr_env.c to return the number of seconds left in the job
  allocation.

MySQL DB Logging:
  SCR can log events to a MySQL database.  However, this feature is
  still in development.  To enable logging to MySQL, configure with
  --with-mysql. If you would like to give it a try, see the
  scr.mysql file for further instructions.

The compile-time define variables are specified in src/scr_conf.h.  One
may modify the settings in this file before building to set different
defaults.  In many cases, one may be able to avoid having to create a
system configuration using this method.

-----------------------------
Build and install
-----------------------------

Use the standard, configure/make/make install to configure, build, and
install the library, e.g.:

  ./configure \
      --prefix=/usr/local/tools/scr-1.1 \
      --with-scr-config-file=/etc/scr.conf
  make
  make install

To uninstall the installation:

  make uninstall

SCR is layered on top of MPI, so it must be built for the MPI library
installed on the system.  It uses the standard mpicc compiler wrapper.
If there are multiple MPI libraries, a separate SCR library must be
built for each MPI library.

If you are using the scr.conf file, you'll also need to edit this file
to match the settings on your system.  Please open this file and make
any necessary changes -- it is self-documented with comments.  After
modifying this file, copy it to the location specified in your configure
step.  The configure option simply informs the SCR install where to
look for the file, it does not modify or install the file.

-----------------------------
SCR directories
-----------------------------

There are three types of directories where SCR manages files.

(1) Cache directory
SCR instructs the application to write its checkpoint files to a cache
directory, and SCR computes and stores redundancy data here.  SCR can
be configured to use more than one cache directory in a single run.
The cache directory must be local to the compute node.

The cache directory must be large enough to hold at least one full
application checkpoint.  However, it is better if the cache is large
enough to store at least two checkpoints.  When using storage local
to the compute node, the cache must be roughly 0.5x-3x the size of
the node's memory to be effective.

LLNL currently uses RAM disk for the cache directory, and LLNL is now
exploring SSDs.

The library is hard-coded to use /tmp as the cache directory.  To
change this location, do any of the following:
  * change the SCR_CACHE_BASE value in src/scr_conf.h,
  * specify the new path during configure using the
     --with-scr-cache-base=PATH option,
  * or set the value in a system configuration file.
To enable multiple cache directories, one must use a system
configuration file.

(2) Control directory
SCR stores files to track its internal state in the control directory.
It also uses this directory to read commands from the user.  Currently,
this directory must be local to the compute node.  Files written to the
control directory are small and they are read and written to frequently,
so RAM disk works well for this.  There are a variety of files kept in
the control directory.  For some of these files, all processes maintain
their own file, and for others, only rank 0 accesses the file.

The library is hard-coded to use /tmp as the control directory.  To
change this location, do any of the following:
  * change the SCR_CNTL_BASE value in src/scr_conf.h,
  * specify the new path during configure using the
     --with-scr-cntl-base=PATH option,
  * or set the value in a system configuration file.

(3) Parallel file system prefix directory
This is the directory where SCR will fetch and flush checkpoint sets
to the parallel file system.

The prefix directory defaults to the current working directory of the
application.  To override this, the user can set the SCR_PREFIX
parameter at run time.

-----------------------------
RAM disk
-----------------------------

On systems where compute nodes are diskless, often RAM disk is the only
option for a checkpoint cache.  When using RAM disk, one should
configure the hard limit to be around two-thirds of the amount of memory
on a node.  This is a high limit, however Linux only allocates space as
needed, so it is safe to use a high setting.  Jobs which do not use RAM
disk will not be negatively affected by using a high setting.  This
setting must be set by the system administrators.

-----------------------------
SCR library and commands
-----------------------------

There are two primary components that make up SCR.

(1) Library
First, there's the library the application calls.  This library is
written in C and it makes calls to MPI and POSIX I/O functions to
manage the cached checkpoint files on the cluster.  The library informs
the application where it should write checkpoint files, and then the
library applies the redundancy scheme to those checkpoint files after
the application has written them.  It also supports a rebuild of lost
files upon a restart, and it supports flushes and fetches between the
cache and the prefix directory on the parallel file system.

(2) Commands
Second, there are commands that must be run before and after the job.
After a resource allocation has been granted but before a run is
launched, these commands check that the cache and control directories
are available and healthy on each node.  Nodes can be excluded or the
job can exit if this is not the case.  If multiple runs are attempted
within the same resource allocation (e.g., start run / node fails /
start second run in what's left of the allocation), it's best to do
this check of the cache and control directories before each run is
started to check that a node that used to be healthy hasn't gone bad.
At the end of a job, commands must be run to drain the latest
checkpoint from the cache directory to the parallel file system.
This is necessary since the library may not have flushed its latest
checkpoint before the last run was killed.

The commands are just as vital as the library in order for SCR to
function.  In particular, after a node failure, the commands must be
able to access the cache and control directories of the nodes which
are still healthy.  Thus, the resource manager must be configured to
enable these commands to run on the remaining nodes of an allocation,
which may have sustained one or more failed nodes. 

-----------------------------
Nodeset format
-----------------------------

Several of the commands process the nodeset allocated to the job.  The
node names are expected to be of the form
  clustername#
where clustername is a string of letters common for all nodes,
and # is the node number, like atlas35 and atlas173.

The scr_hostlist.pm perl module compresses lists of such nodenames into a
form like so:
  atlas[31-43,45,48-51]

-----------------------------
IPv4
-----------------------------

Currently, SCR expects to run on a system whose nodes have IPv4
addresses, and each node should have a unique IPv4 address.  The
library uses this address to determine which processes are running
on the same compute node.

-----------------------------
SLURM dependencies
-----------------------------

The current implementation assumes it is running on a system using
SLURM as the resource manager.  However, it should be straight-
forward to extend SCR to work with other resource managers.  Here
are the locations where SCR currently depends on SLURM.

  scripts/scr_env.in
    --nodes option
      reads SLURM_NODELIST to get the nodeset allocated to a job
    --jobid
      reads SLURM_JOBID to get jobid
    --down
      reads SLURM_NODELIST and invokes "sinfo" to identify known down
      nodes

  scripts/scr_halt.in
    --all and --user options
      invokes "squeue" to determine jobids
    --immediate option
      calls "scontrol" and "scancel" to determine SLURM jobstep and to
      stop job
    all options
      call "squeue" to verify jobid is really running
      call "squeue" to get nodeset for jobid to run pdsh

  scripts/scr_srun.in
    calls "srun" to run the job
    attempts to avoid running on bad nodes via "srun -x"
    attempts to ensure rank 0 runs on the same node via "srun -w"

  scr/scr_env.c
    scr_env_jobid() - reads SLURM_JOBID to get jobid string

Additionally, the current implementation depends on SLURM for the
following features:
  * To clean the cache and control directories between job allocations
    via an rm command executed in the SLURM prolog or epilog.
  * To kill a job which has suffered a failure.
  * To allow the batch script to continue running in an allocation
    which has lost one or more nodes.

-----------------------------
pdsh dependencies
-----------------------------

The current implementation uses pdsh in order to run commands
on the nodes allocated to the job.  One could replace pdsh
with some work.  Here are the locations where pdsh is used.

  scripts/scr_flush.in
    invokes pdsh to run "scr_copy" on each node to copy files from
    cache to the parallel file system

  scripts/scr_halt.in
    invokes pdsh to run "scr_halt_cntl" on each node to update halt
    file

  scripts/scr_inspect.in
    invokes pdsh to run "scr_inspect_cache" on each node to identify
    which checkpoints need to be flushed, if any

  scripts/scr_list_down_nodes.in
    invokes pdsh to run "scr_check_node" on each node

-----------------------------
Dependency on files in common directory
-----------------------------

Some commands must be able to read / write files also accessed by
the library.

  scripts/scr_env.in
    --runnodes option
      reads "nodes.scrinfo" from control directory to report the number
      of nodes used in the last run

  scripts/scr_flush_file.in
    reads the "flush.scrinfo" file from the control directory to
    identify whether any checkpoints need to be flushed from cache

  src/scr_inspect_cache.c
    reads "filemap.scrinfo" and "filemap_#.scrinfo" files to locate
    files in cache

  src/scr_copy.c
    reads "filemap.scrinfo" and "filemap_#.scrinfo" files to locate
    files in cache

  src/scr_halt_cntl.c
    reads and writes "halt.scrinfo"

  scripts/scr_retries_halt.in
    reads "halt.scrinfo" file to determine whether the current run
    should halt

  src/scr_transfer.c
    reads and writes "transfer.scrinfo" to get list of files for
    asynchronous flush

Here is how the control directory files are currently accessed.

  File                 Library                   Commands
  ---------------------------------------------------------------------
  halt.scrinfo         rank 0                    scr_halt_cntl,
                                                 scr_retries_halt
  nodes.scrinfo        master rank on each node  scr_env
  flush.scrinfo        master rank on each node  scr_flush_file
  filemap.scrinfo      master rank on each node  scr_inspect_cache,
                                                 scr_copy
  filemap_#.scrinfo    every rank                scr_inspect_cache,
                                                 scr_copy
  transfer.scrinfo     master rank on each node  scr_transfer

-----------------------------
Run location and program language
-----------------------------

Different commands may run in different places.

Compute node only:
  scripts/scr_check_node.in      perl
  src/scr_inspect_cache.in       C
  src/scr_copy.in                C
  src/scr_halt_cntl.c            C
  src/scr_transfer.c             C

Batch script only:
  scripts/scr_prerun.in          bash
  scripts/scr_srun.in            bash
  scripts/scr_postrun.in         bash
  scripts/scr_flush.in           perl
  scripts/scr_env.in             perl
  scripts/scr_cntl_dir.in        perl
  scripts/scr_list_down_nodes.in perl
  src/scr_log_event.c            C
  src/scr_log_transfer.c         C
  src/scr_retries_halt.c         C
  src/scr_flush_file.c           C
  src/scr_nodes_file.c           C

Within or external to batch script:
  scripts/scr_halt.in            perl
  scripts/scr_glob_hosts.in      perl
  src/scr_index.c                C
  src/scr_rebuild_xor.c          C
  src/scr_crc32.c                C
  src/scr_print_hash_file.c      C

About

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published