Byfl helps application developers understand code performance in a hardware-independent way. The idea is that it instruments your code at compile time then gathers and reports data at run time. For example, suppose you wanted to know how many bytes are accessed by the following C code:
double array[100000][100];
volatile double sum = 0.0;
for (int row=0; row<100000; row++)
sum += array[row][0];
Reading the hardware performance counters (e.g., using PAPI) can be misleading. The performance counters on most processors tally not the number of bytes but rather the number of cache-line accesses. Because the array is stored in row-major order, each access to array
will presumably reference a different cache line while each access to sum
will presumably reference the same cache line.
Byfl does the equivalent of transforming the code into the following:
unsigned long int bytes accessed = 0;
double array[100000][100];
volatile double sum = 0.0;
for (int row=0; row<100000; row++) {
sum += array[row][0];
bytes_accessed += 3*sizeof(double);
}
In the above, one can consider the bytes_accessed
variable as a "software performance counter," as it is maintained entirely by software.
In practice, however, Byfl doesn't do source-to-source transformations (unlike, for example, ROSE) as implied by the preceding code sample. Instead, it integrates into the LLVM compiler infrastructure as an LLVM compiler pass. If your application compiles with LLVM's Clang or any of the GNU compilers, you can instrument it with Byfl.
Because Byfl instruments code in LLVM's intermediate representation (IR), not native machine code, it outputs the same counter values regardless of target architecture. In contrast, binary-instrumentation tools such as Pin may tally operations differently on different platforms.
The name "Byfl" comes from "bytes/flops". The very first version of the code counted only bytes and floating-point operations (flops).
If you have reasonably recent versions of GCC, LLVM, Clang, and DragonEgg, you should be able to perform the usual
./configure
make
make install
procedure. See INSTALL.md for a more complete explanation.
Byfl comes with a set of wrapper scripts that simplify instrumentation. bf-clang
and bf-clang++
wrap, respectively, the Clang C and C++ compilers. bf-gcc
, bf-g++
, bf-gfortran
, and bf-gccgo
wrap, respectively, the GNU C, C++, Fortran, and Go compilers. bf-mpicc
, bf-mpicxx
, bf-mpif90
, and bf-mpif77
further wrap the similarly named Open MPI and MPICH wrapper scripts to use the Byfl compiler scripts instead of the default C, C++, and Fortran compilers. Use any of these scripts as you would the underlying compiler. When you run your code, Byfl will output a sequence of BYFL
-prefixed lines to the standard output device and a superset of the data to a binary file (called filename.byfl
by default):
BYFL_INFO: Byfl command line:
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY: 96 flops
BYFL_SUMMARY: 549 integer ops
BYFL_SUMMARY: 67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY: 35 branch ops (1 unconditional and direct + 32 conditional or indirect + 2 function calls or returns + 0 other)
BYFL_SUMMARY: 747 TOTAL OPS
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY: 12,288 flop bits
BYFL_SUMMARY: 94,880 op bits (excluding memory ops)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 0.6667 bytes loaded per byte stored
BYFL_SUMMARY: 373.0000 ops per load instruction
BYFL_SUMMARY: 152.8358 bits loaded/stored per memory op
BYFL_SUMMARY: 3.0000 flops per conditional/indirect branch
BYFL_SUMMARY: 23.3125 ops per conditional/indirect branch
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 13.3333 bytes per flop
BYFL_SUMMARY: 0.8333 bits per flop bit
BYFL_SUMMARY: 1.7158 bytes per op
BYFL_SUMMARY: 0.1079 bits per (non-memory) op bit
BYFL_SUMMARY: -----------------------------------------------------------------
"Bits" are simply bytes*8. "Flop bits" are the total number of bits in all inputs and outputs to each floating-point function. As motivation, consider the operation A = B + C
, where A
, B
, and C
reside in memory. This operation consumes 12 bytes per flop if the arguments are all single-precision but 24 bytes per flop if the arguments are all double-precision. Similarly, A = -B
consumes either 8 or 16 bytes per flop based on the argument type. However, all of these examples consume one bit per flop bit regardless of numerical precision: every bit loaded or stored either enters or exits the floating-point unit. Bit:flop-bit ratios above 1.0 imply that more memory is moved than fed into the floating-point unit; Bit:flop-bit ratios below 1.0 imply register reuse.
The Byfl wrapper scripts accept a number of options to provide more information about your program at a cost of increased execution times. These can be specified either on the command line or within the BF_OPTS
environment variable. (The former takes precedence.) See the bf-clang
, bf-clang++
, bf-gcc
, bf-g++
, bf-gfortran
, or bf-gccgo
man page for a description of all of the information Byfl can report.
The following represents some sample output from a code instrumented with Byfl and most of the available options:
BYFL_INFO: Byfl command line: -bf-unique-bytes -bf-vectors -bf-every-bb -bf-mem-footprint -bf-types -bf-inst-mix -bf-by-func -bf-call-stack
BYFL_FUNC_HEADER: LD_bytes ST_bytes LD_ops ST_ops Flops FP_bits Int_ops Int_op_bits Uniq_bytes Cond_brs Invocations Function
BYFL_FUNC: 512 768 2 65 96 12288 549 94880 768 32 1 main
BYFL_CALLEE_HEADER: Invocations Byfl Function
BYFL_CALLEE: 1 No __printf_chk
BYFL_VECTOR_HEADER: Elements Elt_bits Type Tally Function
BYFL_VECTOR: 32 64 FP 1 main
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1,280 bytes (512 loaded + 768 stored)
BYFL_SUMMARY: 768 unique bytes
BYFL_SUMMARY: 512 addresses cover 50% of all dynamic loads and stores
BYFL_SUMMARY: 96 flops
BYFL_SUMMARY: 549 integer ops
BYFL_SUMMARY: 67 memory ops (2 loads + 65 stores)
BYFL_SUMMARY: 35 branch ops (1 unconditional and direct + 32 conditional or indirect + 2 function calls or returns + 0 other)
BYFL_SUMMARY: 747 TOTAL OPS
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 10,240 bits (4,096 loaded + 6,144 stored)
BYFL_SUMMARY: 6,144 unique bits
BYFL_SUMMARY: 12,288 flop bits
BYFL_SUMMARY: 94,880 op bits (excluding memory ops)
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 1 vector operations (FP & int)
BYFL_SUMMARY: 32.0000 elements per vector
BYFL_SUMMARY: 64.0000 bits per element
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 2 loads of vectors of 64-bit floating-point values
BYFL_SUMMARY: 64 stores of 64-bit floating-point values
BYFL_SUMMARY: 1 stores of vectors of 64-bit floating-point values
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 96 Add instructions executed
BYFL_SUMMARY: 67 BitCast instructions executed
BYFL_SUMMARY: 65 Store instructions executed
BYFL_SUMMARY: 64 GetElementPtr instructions executed
BYFL_SUMMARY: 64 Trunc instructions executed
BYFL_SUMMARY: 64 SRem instructions executed
BYFL_SUMMARY: 64 SIToFP instructions executed
BYFL_SUMMARY: 64 Mul instructions executed
BYFL_SUMMARY: 33 Br instructions executed
BYFL_SUMMARY: 32 PHI instructions executed
BYFL_SUMMARY: 32 ICmp instructions executed
BYFL_SUMMARY: 32 Shl instructions executed
BYFL_SUMMARY: 3 Alloca instructions executed
BYFL_SUMMARY: 2 ExtractElement instructions executed
BYFL_SUMMARY: 2 Call instructions executed
BYFL_SUMMARY: 2 Load instructions executed
BYFL_SUMMARY: 1 Ret instructions executed
BYFL_SUMMARY: 1 FMul instructions executed
BYFL_SUMMARY: 688 TOTAL instructions executed
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 512 bytes cover 80.0% of memory accesses
BYFL_SUMMARY: 768 bytes cover 100.0% of memory accesses
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 0.6667 bytes loaded per byte stored
BYFL_SUMMARY: 373.0000 ops per load instruction
BYFL_SUMMARY: 152.8358 bits loaded/stored per memory op
BYFL_SUMMARY: 3.0000 flops per conditional/indirect branch
BYFL_SUMMARY: 23.3125 ops per conditional/indirect branch
BYFL_SUMMARY: 0.0312 vector ops (FP & int) per conditional/indirect branch
BYFL_SUMMARY: 0.0104 vector ops (FP & int) per flop
BYFL_SUMMARY: 0.0013 vector ops (FP & int) per op
BYFL_SUMMARY: 1.0843 ops per instruction
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 13.3333 bytes per flop
BYFL_SUMMARY: 0.8333 bits per flop bit
BYFL_SUMMARY: 1.7158 bytes per op
BYFL_SUMMARY: 0.1079 bits per (non-memory) op bit
BYFL_SUMMARY: -----------------------------------------------------------------
BYFL_SUMMARY: 8.0000 unique bytes per flop
BYFL_SUMMARY: 0.5000 unique bits per flop bit
BYFL_SUMMARY: 1.0295 unique bytes per op
BYFL_SUMMARY: 0.0648 unique bits per (non-memory) op bit
BYFL_SUMMARY: 1.6667 bytes per unique byte
BYFL_SUMMARY: -----------------------------------------------------------------
Under the covers, the Byfl wrapper scripts are using GCC to compile code to GCC IR and DragonEgg to convert GCC IR to LLVM IR, which is output in LLVM bitcode format. The wrapper scripts then run LLVM's opt
command, specifying Byfl's bytesflops
plugin as an additional compiler pass. The resulting instrumented bitcode is then converted to native machine code using Clang. The following is an example of how to manually instrument myprog.c
without using the wrapper scripts:
$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.opt.bc
$ opt -load /usr/local/lib/bytesflops.so -bytesflops -bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb myprog.opt.bc -o myprog.inst.bc
$ opt -O3 myprog.inst.bc -o myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition -lm /usr/local/lib/libbyfl.bc -lstdc++ -lm
The bf-inst
script makes these steps slightly simpler:
$ gcc -g -fplugin=/usr/local/lib/dragonegg.so -fplugin-arg-dragonegg-emit-ir -O3 -Wall -Wextra -S myprog.c
$ opt -O3 myprog.s -o myprog.bc
$ bf-inst -bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb myprog.bc
$ clang myprog.bc -o myprog -L/usr/lib/gcc/x86_64-linux-gnu/4.7/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../lib/ -L/lib/x86_64-linux-gnu/ -L/lib/../lib/ -L/usr/lib/x86_64-linux-gnu/ -L/usr/lib/../lib/ -L/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../ -L/lib/ -L/usr/lib/ -L/usr/local/lib -Wl,--allow-multiple-definition /usr/local/lib/libbyfl.bc -lstdc++ -lm
If GCC and DragonEgg are not required, Byfl instrumentation is even easier to apply manually:
$ clang -O3 -c -emit-llvm myprog.c -o myprog.bc
$ bf-inst -bf-unique-bytes -bf-by-func -bf-call-stack -bf-vectors -bf-every-bb myprog.bc
$ clang myprog-bc -o myprog /usr/local/lib/libbyfl.bc -lstdc++ -lm
This basic approach can be useful for instrumenting code in languages other than C, C++, and Fortran. For example, code compiled with any of the other GCC frontends can be instrumented as above. Also, recent versions of the Glasgow Haskell Compiler can compile directly to LLVM bitcode, although Byfl has not yet been successfully applied to GHC-generated code.
Byfl provides a set of programs for converting Byfl binary output (*.byfl
files) into various other formats. Currently, these include the following:
bfbin2csv
- Output Byfl data in comma-separated value (CSV) format, suitable for processing with a scripting language (e.g., AWK or Perl)
bfbin2xmlss
- Output Byfl data in XML Spreadsheet format, suitable for processing with many spreadsheet programs, including LibreOffice, Microsoft Excel, and Numbers for Mac
bfbin2sqlite3
- Output Byfl data as a SQLite3 database, suitable for processing with SQLite command-line tools or importing into other database management systems
bfbin2hdf5
- Output Byfl data as a Hierarchical Data Format (HDF5) file for longer-term storage and processing (e.g., with HDFView)
Developers can write additional conversion tools using a simple API (installed as bfbin.h
and libbfbin.a
) that parses .byfl
files into tables and invokes program-specified callback functions for each table component.
In addition to the above, Byfl installs two scripts to convert Byfl textual output (lines beginning with BYFL
) into formats readable by various GUIs. bf2cgrind
converts Byfl output into KCachegrind input, and bf2hpctk
converts Byfl output into HPCToolkit input. (The latter program is more robust and appears to be more actively maintained.) Run each of those scripts with no arguments to see the usage text. These scripts will eventually be replaced by programs based on the Byfl parsing API.
Los Alamos National Security, LLC (LANS) owns the copyright to Byfl, which it identifies internally as LA-CC-12-039. The license is BSD-ish with a "modifications must be indicated" clause. See LICENSE.md for the full text.
Scott Pakin, pakin@lanl.gov
Pat McCormick, pat@lanl.gov
Rob Aulwes rta@lanl.gov
Eric Anger, eanger@gmail.com
Christine Sweeney, cahrens@lanl.gov