Simple benchmarking program with fancy graphs. It measures the time which was taken to process some (varying) amount of data by the algorithm. Then it just plots the relationship between time and amount of processed data.
This project is written for (self)educational purposes and fun. Some things may look a bit awkward.
Dependencies: linux, C++ compiler with C++11 support (tested with g++ 4.8.4, clang 3.6.0), CUDA (optional).
git clone https://github.com/yekm/bench
git submodule update --init
mkdir build
cd build
cmake ..
make
Make separate folder and run bench from where
mkdir tmp && cd tmp
Listing all tasks and algorithms
Task 0: popcnt perfomance
32bit SWAR popcnt
Brian Kernighan popcnt
Simple popcnt
generalized 32 bit SWAR popcnt
generalized 64 bit SWAR popcnt
intrinsics _mm_popcnt_u64 manual asm popcnt
intrinsics _mm_popcnt_u64 popcnt
intrinsics _mm_popcnt_u64 unrolled popcnt
table lookup popcnt
thrust popcnt
Task 1: sorting algorithms
Insertion sort n^2
Introsort std::sort n*log(n)
Merge sort n*log(n)
Selection sort n^2
Shell sort n*log^2(n)
swenson binary insertion sort n^2
swenson grail sort
swenson heapsort n*log(n)
swenson mergesort n*log(n)
swenson quiksort n*log(n)
swenson selection sort n^2
swenson sell sort n*log^2(n)
swenson sqrt sort
swenson timsort n*log(n)
thrust::sort
Task 2: Sorting algorithms, partially sorted data, 1000000 elements
Insertion sort n^2
Introsort std::sort n*log(n)
Merge sort n*log(n)
Shell sort n*log^2(n)
swenson timsort n*log(n)
thrust::sort
Task 3: Threaded counting, 1e7 elements (false sharing)
false sharing, default alignment
fixed false sharing, 64 bytes alignment
Task 4: Threaded counting, 1e7 elements (atomic vs mutex)
atomic
mutex
Task 5: 1000000 vector lengths
handmade unrolling
loop unrolling
template unrolling
Quick run
$ ../bench -t 0.1
Skip some tasks. However final listing will include this skipped tasks and old measured perfomance data will be preserved between runs.
$ ../bench -t 1 -s 2
Each algorithm runs a
(by default 3) times with same data to minimize the
measurement error.
It is possible to regenerate data and run algorithm a
times again. The number
of iterations of data regeneration specified by -b
$ ../bench -t 1 -s 0,2 -a 2 -b 3
For now it is quite uncomfortable. You need a web server and a couple of links. And a modern web browser with js promises support.
$ for f in ../../html/d3/*; do ln -s $f $(basename $f); done
$ python3 -m http.server 8082
$ $BROWSER http://localhost:8082
Adding algorithms for benchmarking is easy (at least I've tried to make it easy).
Run ./create_task.sh new_task_name_here
from tasks/template
directory. Explore the new
tasks/new_task_name_here
directory.
Decide what type of data your algorithm should process. Derive a class from GenericData<T>
or use common::RandomData<T>
and others. GenericData<T>
has two useful functions
T & get_mutable()
and const T & get_const()
.
If algorithm is not modifying it should ask for const data. In this case next algorithm
will get just the same data. Otherwise modified data discarded and new data
generated (which should be exactly the same as previous, but it can depend on actual
implementation).
Derive from Task
and implement prepare_data(size_t n)
. Here you create and return
your data of size n. For example:
return std::make_shared<common::RandomData<int>>(n);
RandomData<T>
derived from GenericData<std::vector<T>>
, so get_mutable()
will
return std::vector<T>&
.
Amount of data will be doubled on each run of algorithm. Testing will
be stopped then execution time exeeds 60 sec (by default) or amount of data
exceeds memory capacity (std::bad_alloc thrown). Timeout can be changed by
command line arguments. Data growth can be changed by reimplementing virtual
function Task::get_n()
.
Derive from Algorithm
and implement do_run(TaskData & td, std::unique_ptr<AResult> &)
.
Here you cast passed TaskData
to GenericData<T>
which you pick earlier and do some
processing. Running time of this function is measured. For example, in case of RandomData<int>
:
std::vector<int> &d = static_cast<GenericData<std::vector<int>>&>(td).get_mutable();
std::sort(d.begin(), d.end());
Write a cpp file in which make a static struct. In constructor you should create Task and all Algorithms. Add Algorithms to Task and add Task to TaskCollection. Since your struct is static and TaskCollection is a singleton all algorithms and tasks will be created and registered in TaskCollection automatically at program launch. And thus there is no need to make any changes in present code.
Quite messy description. You should look at actual code, it is simple and clear.
Aka don't try this at home work.
- static struct for initialisation/registering objects in collection (
tasks/*/*.cpp
) - dumb/useless singletons (
taskcollection.hpp
) - protected data members (
genericdata.hpp
) - loose JSON handling (
utils/output/jsonoutput.cpp
) - lack of comments
https://git-scm.com/book/en/v2/Git-Tools-Advanced-Merging
git remote add swenson_sort https://github.com/swenson/sort
git remote add despacer_remote https://github.com/lemire/despacer
git remote add pdqsort_remote https://github.com/orlp/pdqsort
git checkout -b swenson_branch swenson_sort/master
git checkout -b despacer_branch despacer_remote/master
git checkout -b pdqsort_branch pdqsort_remote/master
git read-tree --prefix=tasks/simple_sorting/swenson-sort -u swenson_branch
git read-tree --prefix=tasks/despace/despacer -u despacer_branch
git read-tree --prefix=tasks/simple_sorting/pdqsort -u pdqsort_branch
MIT. However, this project uses various sources from OpenSSL ans OpenSSH. I haven't read thoroughly throught their licenses but I guess they somehow make restrictions in what licence can I use in derived projects. Any suggestions are welcome.