🎨 Buttery Smooth Automatic Differentiation in C++


An automatic differentiation library that uses reverse-mode differentation (backpropagation) to differentiate recurrent neural networks, or most mathematical expressions through control flow, while loops, recursion.

This is an reimagination of Andrej Kaparthy's recurrentJS (Github) in C++. It has similar API names but the backbones are using MShadow and C++11's standard library.

@authors Jonathan Raiman and Szymon Sidor


Why not use Theano?

Theano is a fantastic tensor and automatic differentiation library, with excellent packages for Deep Learning. Unfortunately, it cannot differentiate through control flow, and computation graphs with many nodes and recurrence require long compilation time (this may somewhat change with the arrival of Josh Schulman's Graph Computation Toolkit). Long compilation times can be alleviated by moving most operations out of scan loops, however this strongly limits expressivity or complicates the code. Finally, because of the separation between the computation and the mathematical description, debugging can be hard.

(Note: Hypergrad offers gradient through control flow, but does not match the performance of Theano)

Why not use Torch?

Torch has excellent community support and a wide variety of packages for Deep Learning, including the popular NN and NN Graph packages, which permit automatic differentiation of Torch Tensors. However, use of these packages requires the definition of forward and backward passes, module / param cloning (See the Torch utilities inside Andrej Karpathy's Char-RNN code), pre-allocation of memory when performing recurrence, and the requirement that all parameters be concatenated when optimizing in Optim, the defacto solver for Torch. Additionally, transfering computation to the GPU demands pre-allocation of memory, which can be problematic in the case of memory-hungry tasks. Finally, running computations in parallel (Hogwild or otherwise) is tricky in Lua / Torch.


Run a super duper simple example

Create two 3x3 matrices filled with uniform random noise between -2 and 2:

Mat<float> A(3,3, weights<float>::uniform(-2.0, 2.0));
Mat<float> B(3,3, weights<float>::uniform(-2.0, 2.0));

Now let's multiply them:

auto C = A * B;

Now's let take the gradient of the squared sum of this operation:

auto error = (C ^ 2).sum();

And get the gradient of error with respect to A and B:


auto A_gradient = A.dw();
auto B_gradient = B.dw();
Behind the scenes:

Each matrix has another matrix called dw that holds the elementwise gradients for each matrix. When we multiply the matrices together we create a new output matrix called C, and we also add this operation to our computational graph (held by a thread local variable in graph::tape). When we reach C.sum() we also add this operation to our graph.

Computing the gradient is done in 2 steps, first we tell our graph what the objective function is:


error needs to be a scalar (a 1x1 matrix in this implementation) to use grad(). Step 2 is to call graph::backward() and go through every operation executed so far in reverse using graph::tape's record. When we run through the operations backward we update the gradients of each intermediary object until A and B's dws get updated. Those are now the gradients we we're looking for.

Run a simple (yet advanced) example

Let's run a simple example. We will use data from Paul Graham's blog to train a language model. This way we can generate random pieces of startup wisdom at will! After about 5-10 minutes of training time you should see it generate sentences that sort of make sense. To do this go to build and call:

examples/language_model --flagfile ../flags/language_model_simple.flags
  • A more extensive example for training a language model can be found under: examples/language_model.cpp.
  • For a more in-depth description of usage see the character model tutorial
  • For a funny example where you teach stacked LSTMs about multiplication, substraction, and addition check this out.


Get GFlags, HiRedis, Clang, and protobuf, then head to the build folder and use cmake to configure and create the appropriate Makefiles.

You need the latest version of Clang (>= 3.6.0).

1. Dependency Installation
1.a on Mac OSX
brew install cmake
brew install gflags
HOMEBREW_CC=clang HOMEBREW_CXX=clang++ brew install protobuf
brew install libev
HOMEBREW_CC=clang HOMEBREW_CXX=clang++ brew install hiredis
cmake ..
1.b on Fedora Linux
yum install make cmake
yum install blas blas-devel
yum install openblas openblas-devel
yum install clang
yum install gflags gflags-devel
yum install sqlite-devel
yum install protobuf protobuf-devel protobuf-compiler
yum install libev libev-devel
yum install hiredis hiredis-devel

If during compilation cblas.h is not found, install the Atlas SSE fixes the problem:

yum install atlas-sse2-devel
2. Compilation

Then use cmake to create the make targets, and run make to compile the code:

With CUDA (if available)
git submodule init
git submodule update
cd build
cmake ..
make -j 9
Without CUDA:
git submodule init
git submodule update
cd build_cpu
cmake .. -DWITH_CUDA=false
make -j 9

That's it. Now built examples will be stored in build/examples. For instance a character prediction model using Stacked LSTMs is built under build/examples/character_prediction.


To compile and run tests you need Google Tests. Download it here.

1. Compile and run tests

From the build (or build_cpu) folder do the following:

cmake ..
make -j 9 run_tests
2.a Install Gtest on Mac OSX

Homebrew does not offer a way of installing gtest, however in a few steps you can get it running:

cd gtest-1.7.0
mkdir mybuild
cd mybuild
cmake ..
make -j 9
cp libgtest_main.a /usr/local/lib/libgtest_main.a
cp libgtest.a /usr/local/lib/libgtest.a
cp -R ../include/* /usr/local/include/
cd ../..
rm -rf gtest-1.7.0
2.b Install Gtest on Fedora Linux

Using yum it's a piece of cake:

sudo yum install gtest gtest-devel

Latest Clang compiler on Mac OSX

Until Apple decides to fully embrace thread_local abstraction we are sadly forced to update our compilers manually (and no replacing with __thread is not enough...). Here are steps for updating your compiler:

# Go to
# Download "Clang for OSX" (tarball). Use version
# 3.6.0 or above
# Unpack .tar.xz (which will by default be in ~/Downloads)
tar xf CLANG.tar.xz
# Then cd into clang and copy to /usr/local:
cp -R ./* /usr/local/


In the utilities namespace you will find several tools to make data processing and saving easier.

To create folders similar to how os.makedirs works in Python, you can do:


Random integer between 0 and 2 (included):

utils::randint(0, 2);

Check whether a file is gzipped:


Sort the arguments of a list np.argsort style:

auto sorted_lengths = utils::argsort(lengths);

Future steps

  • Add ImageNet, Caffe loading, broader ConvNet support (currently have conv2d and conv1d, but no pooling)
  • Web interface for managing experiments (today Dali-visualizer only shows progress and sample predictions).
  • Web interface for visualizing network activity.
  • Add some mathematical expressions from Deepmind's Torch Cephes module.
  • Distribute training over multiple machines.
  • Ensure feature parity with Python extension
  • Implement multigpu support with Fast Asynchronous Parallel SGD
  • Make it brew, yum/dnf and apt-get installable

Additional Notes

Debugging Assertion Failures

You can use gdb to debug assertion failures in Dali. The majority of the assertions in Dali use utils::assert2 instead of the usual assert method to provide more informative error messages. It is easy to catch and trace these errors using gdb:

gdb --args example/dali_code.o arg1 arg2
catch throw

A stack trace for the assertion error should now appear.

Theme song

Suggested theme song


