Skip to content

eanger/byfl-eaudit

Repository files navigation

Byfl: Compiler-based Application Analysis

Description

Byfl helps application developers understand code performance in a hardware-independent way. The idea is that it instruments your code at compile time then gathers and reports data at run time. For example, suppose you wanted to know how many bytes are accessed by the following C code:

double array[100000][100];
volatile double sum = 0.0;

for (int row=0; row<100000; row++)
  sum += array[row][0];

Reading the hardware performance counters (e.g., using PAPI) can be misleading. The performance counters on most processors tally not the number of bytes but rather the number of cache-line accesses. Because the array is stored in row-major order, each access to array will presumably reference a different cache line while each access to sum will presumably reference the same cache line.

Byfl does the equivalent of transforming the code into the following:

unsigned long int bytes accessed = 0;
double array[100000][100];
volatile double sum = 0.0;

for (int row=0; row<100000; row++) {
  sum += array[row][0];
  bytes_accessed += 3*sizeof(double);
}

In the above, one can consider the bytes_accessed variable as a "software performance counter," as it is maintained entirely by software.

In practice, however, Byfl doesn't do source-to-source transformations (unlike, for example, ROSE) as implied by the preceding code sample. Instead, it integrates into the LLVM compiler infrastructure as an LLVM compiler pass. If your application compiles with LLVM's Clang or any of the GNU compilers, you can instrument it with Byfl.

Because Byfl instruments code in LLVM's intermediate representation (IR), not native machine code, it outputs the same counter values regardless of target architecture. In contrast, binary-instrumentation tools such as Pin may tally operations differently on different platforms.

The name "Byfl" comes from "bytes/flops". The very first version of the code counted only bytes and floating-point operations (flops).

Installation

Automatic installation

Byfl relies on LLVM, Clang, and DragonEgg. These are huge and must be built from trunk (i.e., the post-3.2-release development code). The build-llvm-byfl script automatically downloads all of these plus Byfl, configures them, builds them, and installs the result into a directory you specify.

Byfl also relies on GCC. You should already have GCC installed and in your path before running build-llvm-byfl. The LLVM guys currently seem to do most of their testing with GCC 4.6 so that's your best bet for having everything work.

build-llvm-byfl takes one required argument, which is the root of the installation directory (e.g., /usr/local or /opt/byfl or whatnot). The following optional arguments can appear before the required argument:

-b build_dir
Build Byfl and its dependencies in directory build_dir.
Default: ./byfl-build.random/
-j parallelism
Specify the maximum number of processes to use for compilation, passed directly to make -j.
Default: number of entries in /proc/cpuinfo
-d
Download Byfl and its dependencies into build_dir, but don't configure, build, or install them.
Default: off
-c
Configure, build, and install Byfl and its dependencies without re-downloading them into build_dir.
Default: off
-t
Display progress textually instead of with a GUI progress bar (Zenity).
Default: GUI display if available

Manual installation

If, for whatever reason, you're unable to run the automatic build script, you can always manually build and install Byfl and its prerequisites. Byfl depends on LLVM (the compiler infrastructure), Clang (an LLVM-based C/C++ compiler), and DragonEgg (a technically optional but strongly recommended tool for using GCC compilers as LLVM front ends). See the following URLs for instructions on building each of these:

I use the following configure line in my top-level LLVM directory:

./configure --enable-optimized --enable-debug-runtime --enable-debug-symbols --disable-assertions CC=gcc CXX=g++ REQUIRES_RTTI=1

Run make to build LLVM and Clang and make install to install them. Then, with the LLVM bin directory in your path, run make in the DragonEgg directory. Copy dragonegg.so to your LLVM lib directory.

The following steps can then be used to build and install Byfl:

cd autoconf
yes $HOME/llvm | ./AutoRegen.sh
mkdir ../build
cd ../build
../configure --disable-assertions --enable-optimized --enable-debug-runtime --enable-debug-symbols DRAGONEGG=/usr/local/lib/dragonegg.so CXX=g++ CXXFLAGS="-g -O2 -std=c++0x" --with-llvmsrc=$HOME/llvm --with-llvmobj=$HOME/llvm
make
make install

The $HOME/llvm lines in the above refer to your LLVM source (not installation) directory. Also, be sure to adjust the location of dragonegg.so as appropriate.

Usage

Basic usage

Byfl comes with a set of wrapper scripts that simplify instrumentation. bf-gcc, bf-g++, bf-gfortran, and bf-gccgo wrap, respectively, the GNU C, C++, Fortran, and Go compilers. bf-mpicc, bf-mpicxx, bf-mpif90, and bf-mpif77 further wrap the similarly named Open MPI and MPICH wrapper scripts to use the Byfl compiler scripts instead of the default C, C++, and Fortran compilers. Use any of these scripts as you would the underlying compiler. When you run your code, Byfl will output a sequence of BYFL-prefixed lines to the standard output device:

BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                     1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY:                        32 flops
BYFL_SUMMARY:                       576 integer ops
BYFL_SUMMARY:                        67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                    10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY:                     6,144 flop bits
BYFL_SUMMARY:                    85,024 integer op bits
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                    0.6667 bytes loaded per byte stored
BYFL_SUMMARY:                  288.0000 integer ops per load instruction
BYFL_SUMMARY:                  152.8358 bits loaded/stored per memory op
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                   40.0000 bytes per flop
BYFL_SUMMARY:                    1.6667 bits per flop bit
BYFL_SUMMARY:                    2.2222 bytes per integer op
BYFL_SUMMARY:                    0.1204 bits per integer op bit
BYFL_SUMMARY: -----------------------------------------------------------------

"Bits" are simply bytes*8. "Flop bits" are the total number of bits in all inputs and outputs to each floating-point function. As motivation, consider the operation A = B + C, where A, B, and C reside in memory. This operation consumes 12 bytes per flop if the arguments are all single-precision but 24 bytes per flop if the arguments are all double-precision. Similarly, A = -B consumes either 8 or 16 bytes per flop based on the argument type. However, all of these examples consume one bit per flop bit regardless of numerical precision: every bit loaded or stored either enters or exits the floating-point unit. Bit:flop-bit ratios above 1.0 imply that more memory is moved than fed into the floating-point unit; Bit:flop-bit ratios below 1.0 imply register reuse.

The Byfl wrapper scripts accept a number of options to provide more information about your program at a cost of increased execution times. The following can be specified either on the command line or within the BF_OPTS environment variable. (The former takes precedence.)

-bf-types
Tally type-specific loads and stores of register friendly types. The current set of included types are: single- and double-precision floating point values, 8-,16-, 32- and 64- integer values, and pointers. Remaining types will be categorized as other types.
-bf-inst-mix
Track the overall instruction mix of the program. This counts the number of times each instruction in the intermediate representation is issued and produces a histogram output at the end of program execution. Details on the intermediate language can be found in the LLVM Language Reference Manual.
-bf-every-bb
Output counters for every basic block executed.
-bf-merge-bb=number
When used with -bf-every-bb, merge every number basic-block readings into a single line of output. (I typically specify -bf-merge-bb=1000000.)
-bf-vectors
Report statistics on vector operations (element sizes and number of elements). Unfortunately, at the time of this writing (July 2012), LLVM's autovectorizer is extremely limited and is unable to manipulate arbitrary-length vectors—even though the IR supports them.
-bf-by-func
Output counters for every function executed.
-bf-call-stack
When used with -bf-by-func, distinguish functions by call path. That is, if function f calls functions g and h, -bf-by-func by itself will output counts for each of the three functions while including -bf-call-stack will output counts for the two call stacks f→g and f→h.
-bf-include=function1[,function2,…]
Instrument only the named functions. function can be a symbol name (as reported by nm), a demangled C++ symbol name (as reported by nm -C), or @filename, in which case a list of functions is read from file filename, one function per line.
-bf-exclude=function1[,function2,…]
Instrument all but the named functions. function can be a symbol name (as reported by nm), a demangled C++ symbol name (as reported by nm -C), or @filename, in which case a list of functions is read from file filename, one function per line.
-bf-thread-safe
Indicate that the application is multithreaded (e.g., with Pthreads or OpenMP) so Byfl should protect all counter updates.
-bf-unique-bytes
Keep track of unique memory locations accessed. For example, if a program accesses 8 bytes at address A, then at B, thenat A again, Byfl will report this as 24 bytes but only 16 unique bytes.
-bf-mem-footprint
Output the program's memory footprint in terms of the amount of memory needed to represent various fractions of the total number of memory accesses.

Almost all of the options listed above incur a cost in execution time and memory footprint. -bf-unique-bytes is very slow and very memory-hungry: It performs a hash-table lookup and a bit-vector write -- and multiple of those if used with -bf-by-func -- for every byte read or written by the program. -bf-mem-footprint both very slow and very memory-hungry: It updates a 32-bit counter (accessed via a hash-table lookup) for every byte read or written by the program, implying that it requires 4x the memory of the uninstrumented code.

The following represents some sample output from a code instrumented with Byfl and most of the preceding options:

BYFL_INFO: Byfl command line: -bf-inst-mix -bf-by-func -bf-vectors -bf-unique-bytes -bf-every-bb -bf-types -bf-mem-footprint
BYFL_BB_HEADER:             LD_bytes             ST_bytes               LD_ops               ST_ops                Flops              FP_bits              Int_ops          Int_op_bits
BYFL_BB:                           0                    0                    0                    0                    0                    0                    4                  192
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                           0                   16                    0                    2                    2                  192                   20                 2626
BYFL_BB:                         512                  256                    2                    1                   32                 6144                  102                10656
BYFL_FUNC_HEADER:             LD_bytes             ST_bytes               LD_ops               ST_ops                Flops              FP_bits              Int_ops          Int_op_bits           Uniq_bytes             Cond_brs          Invocations Function
BYFL_FUNC:                         512                  768                    2                   65                   96                12288                  746                94880                  768                   32                    1 main
BYFL_CALLEE_HEADER:   Invocations Byfl Function
BYFL_CALLEE:                    1 No   __printf_chk
BYFL_VECTOR_HEADER:             Elements             Elt_bits Type                Tally Function
BYFL_VECTOR:                          32                   64 FP                      1 main
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                     1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY:                       768 unique bytes
BYFL_SUMMARY:                        96 flops
BYFL_SUMMARY:                       549 integer ops
BYFL_SUMMARY:                        67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY:                        34 branch ops (1 unconditional and direct + 32 conditional or indirect + 1 other)
BYFL_SUMMARY:                       746 TOTAL OPS
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                         2 loads of vectors of 64-bit floating-point values
BYFL_SUMMARY:                        64 stores of 64-bit floating-point values
BYFL_SUMMARY:                         1 stores of vectors of 64-bit floating-point values
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                    10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY:                     6,144 unique bits
BYFL_SUMMARY:                    12,288 flop bits
BYFL_SUMMARY:                    94,880 op bits (excluding memory ops)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                         1 vector operations (FP & int)
BYFL_SUMMARY:                   32.0000 elements per vector
BYFL_SUMMARY:                   64.0000 bits per element
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                        96 Add            instructions executed
BYFL_SUMMARY:                        67 BitCast        instructions executed
BYFL_SUMMARY:                        65 Store          instructions executed
BYFL_SUMMARY:                        64 SIToFP         instructions executed
BYFL_SUMMARY:                        64 Trunc          instructions executed
BYFL_SUMMARY:                        64 GetElementPtr  instructions executed
BYFL_SUMMARY:                        64 SRem           instructions executed
BYFL_SUMMARY:                        64 Mul            instructions executed
BYFL_SUMMARY:                        33 Br             instructions executed
BYFL_SUMMARY:                        32 ICmp           instructions executed
BYFL_SUMMARY:                        32 Shl            instructions executed
BYFL_SUMMARY:                        32 PHI            instructions executed
BYFL_SUMMARY:                         3 Alloca         instructions executed
BYFL_SUMMARY:                         2 Call           instructions executed
BYFL_SUMMARY:                         2 Load           instructions executed
BYFL_SUMMARY:                         2 ExtractElement instructions executed
BYFL_SUMMARY:                         1 Ret            instructions executed
BYFL_SUMMARY:                         1 FMul           instructions executed
BYFL_SUMMARY:                       688 TOTAL          instructions executed
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                       512 bytes cover  80.0% of memory accesses
BYFL_SUMMARY:                       768 bytes cover 100.0% of memory accesses
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                    0.6667 bytes loaded per byte stored
BYFL_SUMMARY:                  373.0000 ops per load instruction
BYFL_SUMMARY:                  152.8358 bits loaded/stored per memory op
BYFL_SUMMARY:                    3.0000 flops per conditional/indirect branch
BYFL_SUMMARY:                   23.3125 ops per conditional/indirect branch
BYFL_SUMMARY:                    0.0312 vector ops (FP & int) per conditional/indirect branch
BYFL_SUMMARY:                    0.0104 vector ops (FP & int) per flop
BYFL_SUMMARY:                    0.0013 vector ops (FP & int) per op
BYFL_SUMMARY:                    1.0843 ops per instruction
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                   13.3333 bytes per flop
BYFL_SUMMARY:                    0.8333 bits per flop bit
BYFL_SUMMARY:                    1.7158 bytes per op
BYFL_SUMMARY:                    0.1079 bits per (non-memory) op bit
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY:                    8.0000 unique bytes per flop
BYFL_SUMMARY:                    0.5000 unique bits per flop bit
BYFL_SUMMARY:                    1.0295 unique bytes per op
BYFL_SUMMARY:                    0.0648 unique bits per (non-memory) op bit
BYFL_SUMMARY:                    1.6667 bytes per unique byte
BYFL_SUMMARY: -----------------------------------------------------------------

The Byfl options listed above are accepted directly by the Byfl compiler pass. In addition, the Byfl wrapper scripts (but not the compiler pass) accept the following options:

-bf-verbose
Output all helper commands executed by the wrapper script.
-bf-static
Instead of instrumenting the code, merely output counts of number of instructions of various types.
-bf-disable=feature
Disables various pieces of the instrumentation process. This can be useful for performance comparisons and troubleshooting. The following are acceptable values for -bf-disable:
none (default)
Don't disable anything; run with regular Byfl instrumentation.
byfl
Disable the Byfl compiler pass, but retain all of the internal manipulation of LLVM file types (i.e., bitcode).
bitcode
Process the code with LLVM and DragonEgg but disable LLVM bitcode and use exclusively native object files.
dragonegg
Use a GNU compiler directly, disabling all wrapper-script functionality.

Environment variables

BF_OPTS
The Byfl wrapper scripts (bf-gcc, bf-g++, and bf-gfortran) treat the BF_OPTS environment variable as a list of command-line options. This lets users control the type of Byfl instrumentation used without having to edit Makefiles or other build scripts.
BF_PREFIX
Byfl-instrumented executables expand the BF_PREFIX environment variable, honoring POSIX shell-style variable expansions, and prefix every line of Byfl output with the result. For example, if BF_PREFIX is set to the string Rank ${OMPI_COMM_WORLD_RANK}, then a line that would otherwise begin with BYFL_SUMMARY: will instead begin with Rank 3 BYFL_SUMMARY:, assuming that the OMPI_COMM_WORLD_RANK environment variable has the value 3.
  <p>Although the characters <code>|</code>, <code>&amp;</code>,
  <code>;</code>, <code>&lt;</code>, <code>&gt;</code>,
  <code>(</code>, <code>)</code>, <code>{</code>, and
  <code>}</code> are not normally allowed within
  <code>BF_PREFIX</code>, <code>BF_PREFIX</code> does support
  backquoted-command evaluation, and the child command can contain
  those characters, as in <code>BF_PREFIX='`if true; then (echo
  YES; echo MAYBE); else echo NO; fi`'</code> (which prefixes each
  line with <q><code>YES MAYBE</code></q>).</p>

  <p>As a special case, if <code>BF_PREFIX</code> expands to a
  string that begins with <q><code>/</code></q> or
  <q><code>./</code></q>, it is treated not as a prefix but as a
  filename.  The Byfl-instrumented executable will redirect all of
  its Byfl output to that file instead of to the standard output
  device.</p></dd>

Advanced usage

Under the covers, the Byfl wrapper scripts are using GCC to compile code to GCC IR and DragonEgg to convert GCC IR to LLVM IR, which is output in LLVM bitcode format. The wrapper scripts then run LLVM's opt command, specifying Byfl's bytesflops plugin as an additional compiler pass. The resulting instrumented bitcode is then converted to native machine code using Clang. The following is an example of how to manually instrument myprog.c without using the wrapper scripts:

$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.opt.bc
$ opt -load /usr/local/lib/bytesflops.so -bytesflops -bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb myprog.opt.bc -o myprog.inst.bc
$ opt -O3 myprog.inst.bc -o myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition -lm /usr/local/lib/libbyfl.bc -lstdc++ -lm

The bf-inst script makes these steps slightly simpler:

$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.bc
$ bf-inst `-bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb` myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition /usr/local/lib/libbyfl.bc -lstdc++ -lm

If GCC and DragonEgg are not required, Byfl instrumentation is even easier to apply manually:

$ clang -O3 -c -emit-llvm myprog.c -o myprog.bc
$ bf-inst `-bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb` myprog.bc
$ clang myprog-bc -o myprog /usr/local/lib/libbyfl.bc -lstdc++ -lm

This basic approach can be useful for instrumenting code in languages other than C, C++, and Fortran. For example, code compiled with any of the other GCC frontends can be instrumented as above. Also, recent versions of the Glasgow Haskell Compiler can compile directly to LLVM bitcode.

Postprocessing Byfl output

Byfl installs two scripts to convert Byfl output (lines beginning with BYFL) into formats readable by various GUIs. bf2cgrind converts Byfl output into KCachegrind input, and bf2hpctk converts Byfl output into HPCToolkit input. (The latter program is more robust and appears to be more actively maintained.) Run each of those scripts with no arguments to see the usage text.

In addition, Byfl includes a script called bfmerge, which merges multiple Byfl output files by computing statistics across all of the files of each data value encountered. These output files might represent multiple runs of a sequential application or multiple processes from a single run of a parallel application. Currently, the set of statistics includes the sum, minimum, maximum, median, median absolute deviation, mean, and standard deviation. Thus, bfmerge facilitates quantifying the similarities and differences across applications or processes.

License

Los Alamos National Security, LLC (LANS) owns the copyright to Byfl, which it identifies internally as LA-CC-12-039. The license is BSD-ish with a "modifications must be indicated" clause. See LICENSE.md for the full text.

Authors

Scott Pakin, pakin@lanl.gov
Pat McCormick, pat@lanl.gov

About

byfl augmented with calls to eaudit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published