Byfl helps application developers understand code performance in a hardware-independent way. The idea is that it instruments your code at compile time then gathers and reports data at run time. For example, suppose you wanted to know how many bytes are accessed by the following C code:
double array[100000][100];
volatile double sum = 0.0;
for (int row=0; row<100000; row++)
sum += array[row][0];
Reading the hardware performance counters (e.g., using PAPI) can be misleading. The performance counters on most processors tally not the number of bytes but rather the number of cache-line accesses. Because the array is stored in row-major order, each access to array
will presumably reference a different cache line while each access to sum
will presumably reference the same cache line.
Byfl does the equivalent of transforming the code into the following:
unsigned long int bytes accessed = 0;
double array[100000][100];
volatile double sum = 0.0;
for (int row=0; row<100000; row++) {
sum += array[row][0];
bytes_accessed += 3*sizeof(double);
}
In the above, one can consider the bytes_accessed
variable as a "software performance counter," as it is maintained entirely by software.
In practice, however, Byfl doesn't do source-to-source transformations (unlike, for example, ROSE) as implied by the preceding code sample. Instead, it integrates into the LLVM compiler infrastructure as an LLVM compiler pass. If your application compiles with LLVM's Clang or any of the GNU compilers, you can instrument it with Byfl.
Because Byfl instruments code in LLVM's intermediate representation (IR), not native machine code, it outputs the same counter values regardless of target architecture. In contrast, binary-instrumentation tools such as Pin may tally operations differently on different platforms.
The name "Byfl" comes from "bytes/flops". The very first version of the code counted only bytes and floating-point operations (flops).
Byfl relies on LLVM, Clang, and DragonEgg. These are huge and must be built from trunk (i.e., the post-3.2-release development code). The build-llvm-byfl
script automatically downloads all of these plus Byfl, configures them, builds them, and installs the result into a directory you specify.
Byfl also relies on GCC. You should already have GCC installed and in your path before running build-llvm-byfl
. The LLVM guys currently seem to do most of their testing with GCC 4.6 so that's your best bet for having everything work.
build-llvm-byfl
takes one required argument, which is the root of the installation directory (e.g., /usr/local
or /opt/byfl
or whatnot). The following optional arguments can appear before the required argument:
-b
build_dir- Build Byfl and its dependencies in directory build_dir.
Default:./byfl-build.
random/
-j
parallelism- Specify the maximum number of processes to use for compilation, passed directly to
make -j
.
Default: number of entries in/proc/cpuinfo
-d
- Download Byfl and its dependencies into build_dir, but don't configure, build, or install them.
Default: off -c
- Configure, build, and install Byfl and its dependencies without re-downloading them into build_dir.
Default: off -t
- Display progress textually instead of with a GUI progress bar (Zenity).
Default: GUI display if available
If, for whatever reason, you're unable to run the automatic build script, you can always manually build and install Byfl and its prerequisites. Byfl depends on LLVM (the compiler infrastructure), Clang (an LLVM-based C/C++ compiler), and DragonEgg (a technically optional but strongly recommended tool for using GCC compilers as LLVM front ends). See the following URLs for instructions on building each of these:
-
DragonEgg: http://dragonegg.llvm.org/
I use the following configure
line in my top-level LLVM directory:
./configure --enable-optimized --enable-debug-runtime --enable-debug-symbols --disable-assertions CC=gcc CXX=g++ REQUIRES_RTTI=1
Run make
to build LLVM and Clang and make install
to install them. Then, with the LLVM bin
directory in your path, run make
in the DragonEgg directory. Copy dragonegg.so
to your LLVM lib
directory.
The following steps can then be used to build and install Byfl:
cd autoconf
yes $HOME/llvm | ./AutoRegen.sh
mkdir ../build
cd ../build
../configure --disable-assertions --enable-optimized --enable-debug-runtime --enable-debug-symbols DRAGONEGG=/usr/local/lib/dragonegg.so CXX=g++ CXXFLAGS="-g -O2 -std=c++0x" --with-llvmsrc=$HOME/llvm --with-llvmobj=$HOME/llvm
make
make install
The $HOME/llvm
lines in the above refer to your LLVM source (not installation) directory. Also, be sure to adjust the location of dragonegg.so
as appropriate.
Byfl comes with a set of wrapper scripts that simplify instrumentation. bf-gcc
, bf-g++
, bf-gfortran
, and bf-gccgo
wrap, respectively, the GNU C, C++, Fortran, and Go compilers. bf-mpicc
, bf-mpicxx
, bf-mpif90
, and bf-mpif77
further wrap the similarly named Open MPI and MPICH wrapper scripts to use the Byfl compiler scripts instead of the default C, C++, and Fortran compilers. Use any of these scripts as you would the underlying compiler. When you run your code, Byfl will output a sequence of BYFL
-prefixed lines to the standard output device:
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY: 32 flops
BYFL_SUMMARY: 576 integer ops
BYFL_SUMMARY: 67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY: 6,144 flop bits
BYFL_SUMMARY: 85,024 integer op bits
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 0.6667 bytes loaded per byte stored
BYFL_SUMMARY: 288.0000 integer ops per load instruction
BYFL_SUMMARY: 152.8358 bits loaded/stored per memory op
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 40.0000 bytes per flop
BYFL_SUMMARY: 1.6667 bits per flop bit
BYFL_SUMMARY: 2.2222 bytes per integer op
BYFL_SUMMARY: 0.1204 bits per integer op bit
BYFL_SUMMARY: -----------------------------------------------------------------
"Bits" are simply bytes*8. "Flop bits" are the total number of bits in all inputs and outputs to each floating-point function. As motivation, consider the operation A = B + C
, where A
, B
, and C
reside in memory. This operation consumes 12 bytes per flop if the arguments are all single-precision but 24 bytes per flop if the arguments are all double-precision. Similarly, A = -B
consumes either 8 or 16 bytes per flop based on the argument type. However, all of these examples consume one bit per flop bit regardless of numerical precision: every bit loaded or stored either enters or exits the floating-point unit. Bit:flop-bit ratios above 1.0 imply that more memory is moved than fed into the floating-point unit; Bit:flop-bit ratios below 1.0 imply register reuse.
The Byfl wrapper scripts accept a number of options to provide more information about your program at a cost of increased execution times. The following can be specified either on the command line or within the BF_OPTS
environment variable. (The former takes precedence.)
-bf-types
- Tally type-specific loads and stores of register friendly types. The current set of included types are: single- and double-precision floating point values, 8-,16-, 32- and 64- integer values, and pointers. Remaining types will be categorized as other types.
-bf-inst-mix
- Track the overall instruction mix of the program. This counts the number of times each instruction in the intermediate representation is issued and produces a histogram output at the end of program execution. Details on the intermediate language can be found in the LLVM Language Reference Manual.
-bf-every-bb
- Output counters for every basic block executed.
-bf-merge-bb=
number- When used with
-bf-every-bb
, merge every number basic-block readings into a single line of output. (I typically specify-bf-merge-bb=1000000
.) -bf-vectors
- Report statistics on vector operations (element sizes and number of elements). Unfortunately, at the time of this writing (July 2012), LLVM's autovectorizer is extremely limited and is unable to manipulate arbitrary-length vectors—even though the IR supports them.
-bf-by-func
- Output counters for every function executed.
-bf-call-stack
- When used with
-bf-by-func
, distinguish functions by call path. That is, if functionf
calls functionsg
andh
,-bf-by-func
by itself will output counts for each of the three functions while including-bf-call-stack
will output counts for the two call stacksf
→g
andf
→h
. -bf-include=
function1[,function2,…]- Instrument only the named functions. function can be a symbol name (as reported by
nm
), a demangled C++ symbol name (as reported bynm -C
), or@
filename, in which case a list of functions is read from file filename, one function per line. -bf-exclude=
function1[,function2,…]- Instrument all but the named functions. function can be a symbol name (as reported by
nm
), a demangled C++ symbol name (as reported bynm -C
), or@
filename, in which case a list of functions is read from file filename, one function per line. -bf-thread-safe
- Indicate that the application is multithreaded (e.g., with Pthreads or OpenMP) so Byfl should protect all counter updates.
-bf-unique-bytes
- Keep track of unique memory locations accessed. For example, if a program accesses 8 bytes at address
A
, then atB
, thenatA
again, Byfl will report this as 24 bytes but only 16 unique bytes. -bf-mem-footprint
- Output the program's memory footprint in terms of the amount of memory needed to represent various fractions of the total number of memory accesses.
Almost all of the options listed above incur a cost in execution time and memory footprint. -bf-unique-bytes
is very slow and very memory-hungry: It performs a hash-table lookup and a bit-vector write -- and multiple of those if used with -bf-by-func
-- for every byte read or written by the program. -bf-mem-footprint
both very slow and very memory-hungry: It updates a 32-bit counter (accessed via a hash-table lookup) for every byte read or written by the program, implying that it requires 4x the memory of the uninstrumented code.
The following represents some sample output from a code instrumented with Byfl and most of the preceding options:
BYFL_INFO: Byfl command line: -bf-inst-mix -bf-by-func -bf-vectors -bf-unique-bytes -bf-every-bb -bf-types -bf-mem-footprint
BYFL_BB_HEADER: LD_bytes ST_bytes LD_ops ST_ops Flops FP_bits Int_ops Int_op_bits
BYFL_BB: 0 0 0 0 0 0 4 192
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 0 16 0 2 2 192 20 2626
BYFL_BB: 512 256 2 1 32 6144 102 10656
BYFL_FUNC_HEADER: LD_bytes ST_bytes LD_ops ST_ops Flops FP_bits Int_ops Int_op_bits Uniq_bytes Cond_brs Invocations Function
BYFL_FUNC: 512 768 2 65 96 12288 746 94880 768 32 1 main
BYFL_CALLEE_HEADER: Invocations Byfl Function
BYFL_CALLEE: 1 No __printf_chk
BYFL_VECTOR_HEADER: Elements Elt_bits Type Tally Function
BYFL_VECTOR: 32 64 FP 1 main
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY: 768 unique bytes
BYFL_SUMMARY: 96 flops
BYFL_SUMMARY: 549 integer ops
BYFL_SUMMARY: 67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY: 34 branch ops (1 unconditional and direct + 32 conditional or indirect + 1 other)
BYFL_SUMMARY: 746 TOTAL OPS
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 2 loads of vectors of 64-bit floating-point values
BYFL_SUMMARY: 64 stores of 64-bit floating-point values
BYFL_SUMMARY: 1 stores of vectors of 64-bit floating-point values
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY: 6,144 unique bits
BYFL_SUMMARY: 12,288 flop bits
BYFL_SUMMARY: 94,880 op bits (excluding memory ops)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1 vector operations (FP & int)
BYFL_SUMMARY: 32.0000 elements per vector
BYFL_SUMMARY: 64.0000 bits per element
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 96 Add instructions executed
BYFL_SUMMARY: 67 BitCast instructions executed
BYFL_SUMMARY: 65 Store instructions executed
BYFL_SUMMARY: 64 SIToFP instructions executed
BYFL_SUMMARY: 64 Trunc instructions executed
BYFL_SUMMARY: 64 GetElementPtr instructions executed
BYFL_SUMMARY: 64 SRem instructions executed
BYFL_SUMMARY: 64 Mul instructions executed
BYFL_SUMMARY: 33 Br instructions executed
BYFL_SUMMARY: 32 ICmp instructions executed
BYFL_SUMMARY: 32 Shl instructions executed
BYFL_SUMMARY: 32 PHI instructions executed
BYFL_SUMMARY: 3 Alloca instructions executed
BYFL_SUMMARY: 2 Call instructions executed
BYFL_SUMMARY: 2 Load instructions executed
BYFL_SUMMARY: 2 ExtractElement instructions executed
BYFL_SUMMARY: 1 Ret instructions executed
BYFL_SUMMARY: 1 FMul instructions executed
BYFL_SUMMARY: 688 TOTAL instructions executed
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 512 bytes cover 80.0% of memory accesses
BYFL_SUMMARY: 768 bytes cover 100.0% of memory accesses
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 0.6667 bytes loaded per byte stored
BYFL_SUMMARY: 373.0000 ops per load instruction
BYFL_SUMMARY: 152.8358 bits loaded/stored per memory op
BYFL_SUMMARY: 3.0000 flops per conditional/indirect branch
BYFL_SUMMARY: 23.3125 ops per conditional/indirect branch
BYFL_SUMMARY: 0.0312 vector ops (FP & int) per conditional/indirect branch
BYFL_SUMMARY: 0.0104 vector ops (FP & int) per flop
BYFL_SUMMARY: 0.0013 vector ops (FP & int) per op
BYFL_SUMMARY: 1.0843 ops per instruction
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 13.3333 bytes per flop
BYFL_SUMMARY: 0.8333 bits per flop bit
BYFL_SUMMARY: 1.7158 bytes per op
BYFL_SUMMARY: 0.1079 bits per (non-memory) op bit
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 8.0000 unique bytes per flop
BYFL_SUMMARY: 0.5000 unique bits per flop bit
BYFL_SUMMARY: 1.0295 unique bytes per op
BYFL_SUMMARY: 0.0648 unique bits per (non-memory) op bit
BYFL_SUMMARY: 1.6667 bytes per unique byte
BYFL_SUMMARY: -----------------------------------------------------------------
The Byfl options listed above are accepted directly by the Byfl compiler pass. In addition, the Byfl wrapper scripts (but not the compiler pass) accept the following options:
-bf-verbose
- Output all helper commands executed by the wrapper script.
-bf-static
- Instead of instrumenting the code, merely output counts of number of instructions of various types.
-bf-disable=
feature- Disables various pieces of the instrumentation process. This can be useful for performance comparisons and troubleshooting. The following are acceptable values for
-bf-disable
:none
(default)- Don't disable anything; run with regular Byfl instrumentation.
byfl
- Disable the Byfl compiler pass, but retain all of the internal manipulation of LLVM file types (i.e., bitcode).
bitcode
- Process the code with LLVM and DragonEgg but disable LLVM bitcode and use exclusively native object files.
dragonegg
- Use a GNU compiler directly, disabling all wrapper-script functionality.
BF_OPTS
- The Byfl wrapper scripts (
bf-gcc
,bf-g++
, andbf-gfortran
) treat theBF_OPTS
environment variable as a list of command-line options. This lets users control the type of Byfl instrumentation used without having to editMakefile
s or other build scripts. BF_PREFIX
- Byfl-instrumented executables expand the
BF_PREFIX
environment variable, honoring POSIX shell-style variable expansions, and prefix every line of Byfl output with the result. For example, ifBF_PREFIX
is set to the string
, then a line that would otherwise begin withRank ${OMPI_COMM_WORLD_RANK}
will instead begin withBYFL_SUMMARY:
, assuming that theRank 3 BYFL_SUMMARY:
OMPI_COMM_WORLD_RANK
environment variable has the value3
.<p>Although the characters <code>|</code>, <code>&</code>, <code>;</code>, <code><</code>, <code>></code>, <code>(</code>, <code>)</code>, <code>{</code>, and <code>}</code> are not normally allowed within <code>BF_PREFIX</code>, <code>BF_PREFIX</code> does support backquoted-command evaluation, and the child command can contain those characters, as in <code>BF_PREFIX='`if true; then (echo YES; echo MAYBE); else echo NO; fi`'</code> (which prefixes each line with <q><code>YES MAYBE</code></q>).</p> <p>As a special case, if <code>BF_PREFIX</code> expands to a string that begins with <q><code>/</code></q> or <q><code>./</code></q>, it is treated not as a prefix but as a filename. The Byfl-instrumented executable will redirect all of its Byfl output to that file instead of to the standard output device.</p></dd>
Under the covers, the Byfl wrapper scripts are using GCC to compile code to GCC IR and DragonEgg to convert GCC IR to LLVM IR, which is output in LLVM bitcode format. The wrapper scripts then run LLVM's opt
command, specifying Byfl's bytesflops
plugin as an additional compiler pass. The resulting instrumented bitcode is then converted to native machine code using Clang. The following is an example of how to manually instrument myprog.c
without using the wrapper scripts:
$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.opt.bc
$ opt -load /usr/local/lib/bytesflops.so -bytesflops -bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb myprog.opt.bc -o myprog.inst.bc
$ opt -O3 myprog.inst.bc -o myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition -lm /usr/local/lib/libbyfl.bc -lstdc++ -lm
The bf-inst
script makes these steps slightly simpler:
$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.bc
$ bf-inst `-bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb` myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition /usr/local/lib/libbyfl.bc -lstdc++ -lm
If GCC and DragonEgg are not required, Byfl instrumentation is even easier to apply manually:
$ clang -O3 -c -emit-llvm myprog.c -o myprog.bc
$ bf-inst `-bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb` myprog.bc
$ clang myprog-bc -o myprog /usr/local/lib/libbyfl.bc -lstdc++ -lm
This basic approach can be useful for instrumenting code in languages other than C, C++, and Fortran. For example, code compiled with any of the other GCC frontends can be instrumented as above. Also, recent versions of the Glasgow Haskell Compiler can compile directly to LLVM bitcode.
Byfl installs two scripts to convert Byfl output (lines beginning with BYFL
) into formats readable by various GUIs. bf2cgrind
converts Byfl output into KCachegrind input, and bf2hpctk
converts Byfl output into HPCToolkit input. (The latter program is more robust and appears to be more actively maintained.) Run each of those scripts with no arguments to see the usage text.
In addition, Byfl includes a script called bfmerge
, which merges multiple Byfl output files by computing statistics across all of the files of each data value encountered. These output files might represent multiple runs of a sequential application or multiple processes from a single run of a parallel application. Currently, the set of statistics includes the sum, minimum, maximum, median, median absolute deviation, mean, and standard deviation. Thus, bfmerge
facilitates quantifying the similarities and differences across applications or processes.
Los Alamos National Security, LLC (LANS) owns the copyright to Byfl, which it identifies internally as LA-CC-12-039. The license is BSD-ish with a "modifications must be indicated" clause. See LICENSE.md for the full text.
Scott Pakin, pakin@lanl.gov
Pat McCormick, pat@lanl.gov