algorithm

Some tests and benchmarks of common algorithm

编译

Linux

为了编译这里面的代码，你需要

cmake 2.8及以上版本
gcc或clang。gcc的版本应该>=4.8 , clang的版本应该>=3.3。

git clone https://github.com/snnn/algorithm/trunk
cd algorithm
mkdir build
cd build
cmake ..
make
src/all_unitest

Windows

为了编译这里面的代码，你需要

cmake 2.8及以上版本
Visual Studio，版本大于等于2012

首先从 https://github.com/snnn/algorithm/trunk 用git或者svn检出代码，然后用cmake生成项目文件，然后编译。

目录说明

/src 一些常见的算法的实现，每个都含unitest。
/common 为了适应Linux/Windows而写的一点点操作系统抽象层代码。主要是mutex、条件变量、线程池、snprintf等。
/btree 从老的berkeleydb中整理出来的关于btree和hash索引的代码。过于陈旧，无实用价值，主要是整理一下、阅读学习。
/google_benchmark google提供的一个C++的benchmark框架。我略微做了点小修改。
/gtest google的C++的unitest框架。
/zlib 第三方库。png要用
/libpng 第三方库。

common目录

include/slib/threadpool.h and common/threadpool.cpp: 一个支持定时任务的线程池
include/slib/mutex_pthread.h include/slib/mutex_win.h common/mutex_pthread.cpp common/mutex_win.cpp mutex and condition var

算法目录(不完全）

src/salamin_pi.cpp: 用Brent-Salamin算法计算pi。误差限我还不会算，所以停步条件有问题。
src/heap.h: 数据结构，堆。仿照STL的接口，实现了iterator和对Allocator的支持。
src/fib.cpp: 计算fib数列第n项值。分别用迭代法和分治求n次幂的方法。
src/random.h: 随机数生成器，比stdlib中的要稍好一些。并提供了对intel硬件随机数生成器的封装。
src/sort.h：几种排序算法(冒泡排序、选择排序、堆排序、二路归并排序)及二分查找
src/quick_sort.h：quick sort
src/insert_sort.h：insert sort
src/merge_sort_list_unitest.cpp：对链表进行merge sort
src/draw_binary_tree.h: TR算法绘制二叉树
src/topk.h: 计算第k大的数
src/maxsubarray.h: 最大子数组和 and 长度最长的最大不重复子数组
src/lis.h: Longest strictly Increasing Subsequence(LIS)
src/lcs.cpp: Longest common subsequence (LCS)
src/edit_distance.cpp: Edit distance
src/hash.cpp: 一个简单的hash函数。 char[] -> uint32_t
src/TASLock.h: TAS spin-lock algorithm
src/TTASLock.h: TTAS spin-lock algorithm
src/stackword.h: generate all stack word （见TAOCP对stack word的讨论）

关于benchmark的一些注记

一个问题常有多种算法解决，一个算法常有多种实现，哪个运行效率更高，要视具体环境而定。所以，最好的方法是把它们都实现出来，然后实际跑一下，然后用数理统计的假设检验方法给出一个结论。

我现在做benchmark的框架代码主要来自于Google的开源项目google/benchmark，这个是给C/C++程序用的。Java程序推荐caliper，这个也是google开源出来的，我用了很久，感觉还不错。但是相对来说，C/C++比Java更容易做micro benchmark，因为干扰项更少。

在做benchmark的时候通常关注两个时间，一个是CPU usage time，一个是wall clock time。

CPU time的几种来源：

int getrusage(int who, struct rusage *usage); ru_utime+ru_stime

wall clock time的几种来源：

rdtsc
gettimeofday
clock_gettime

做benchmark之前需要注意的几件事情：

禁用CPU的自动调频。/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sched_setscheduler: set scheduling policy to FIFO, require root privilege.
测试数据要与cache line对齐。所以动态内存要用posix_memalign来分配

一些测试结果

硬件环境1: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz

软件环境1: Ubuntu 14.04 LTS, llvm/clang 3.4（without polly)

编译参数：-std=c++11 -pthread -Wall -Wextra -Wno-unused-parameter -stdlib=libc++ -O3 -mtune=native -march=native -flto -DNDEBUG -g3

关于时间单位： 1 second = 1,000,000 microseconds = 1,000,000,000 nanoseconds。下文中常把nanoseconds缩写成ns。

排序

命令

（我自己实现的）堆排序均匀随机32位整数 ./mybenchmark --benchmark_filter='BM_heap_sort' --benchmark_min_time=2

（libc++ STL）std::sort 均匀随机32位整数 ./mybenchmark --benchmark_filter='BM_std_sort' --benchmark_min_time=2

(FreeBSD libc) qsort 均匀随机32位整数 ./mybenchmark --benchmark_filter='qsort' --benchmark_min_time=2

结果

Benchmark	Time(ns)	CPU(ns)	Iterations
BM_heap_sort<int>/16	342	606	3300111
BM_heap_sort<int>/64	2352	2723	734670
BM_heap_sort<int>/512	29660	31726	63047
BM_heap_sort<int>/4k	308012	323559	6182
BM_heap_sort<int>/32k	3075102	3212472	623
BM_heap_sort<int>/256k	31398673	32604226	62
BM_heap_sort<int>/1024k	172382331	178276917	12
BM_std_sort<int>/16	184	452	4430101
BM_std_sort<int>/64	1210	1566	1277377
BM_std_sort<int>/512	15298	17196	116342
BM_std_sort<int>/4k	169104	183394	10906
BM_std_sort<int>/32k	1705724	1826278	1096
BM_std_sort<int>/256k	16603514	17616035	114
BM_std_sort<int>/1024k	76316858	80586680	25
BM_qsort_int/16	558	1152	1736306
BM_qsort_int/64	3123	4774	418924
BM_qsort_int/512	36114	48319	41413
BM_qsort_int/4k	376951	473376	4226
BM_qsort_int/32k	3716338	4515307	443
BM_qsort_int/256k	36471804	43464255	47
BM_qsort_int/1024k	174555536	204984700	10

其中">>/"右面的数字代表待排序数组的长度。比如4k代表4*1024个int。

结论

无论是否开启LTO(Link-time optimizations), qsort都很慢. 也许不是因为算法差，而是因为实现时C语言的局限（没有模板，类型信息不够丰富，减少了inline的可能性）。

heap sort, quick sort虽然在算法复杂度上AC Time都是一样的。但是实际上heap sort要比quick sort慢一个常数因子。（课本上也是这么说的）

整数自增

当多线程需要访问同一个整数，并进行读写操作时，需要使用一定的同步策略。可以使用互斥量（pthread_mutex_t），也可以用CPU的原子化命令。

如果使用CPU的原子化指令，那么在x86 CPU上，i++操作会变成一条 lock xadd指令。而++i操作会变成lock xadd之后再跟一个普通inc指令（与线程同步无关）。

命令

./mybenchmark --benchmark_filter='Int.*_single_thread' --benchmark_iterations=1000

结果

Benchmark	Time(ns)	CPU(ns)	Iterations
BM_Int_Inc_std_mutex_single_thread/8	197	205	1000
BM_Int_Inc_std_mutex_single_thread/64	1582	1593	1000
BM_Int_Inc_std_mutex_single_thread/64k	1544026	1560324	1000
BM_Int_Inc_atomic_int_single_thread/8	67	73	1000
BM_Int_Inc_atomic_int_single_thread/64	453	460	1000
BM_Int_Inc_atomic_int_single_thread/64k	478326	485313	1000

第二列是wall clock time，第三列是cpu time。

结论

在硬件环境1下，以单线程方式测试，原子化指令耗时大约7ns，mutex耗时大约22ns。后者是前者的3倍。

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
btree		btree
cmake		cmake
common		common
google_benchmark		google_benchmark
gtest		gtest
include		include
libpng		libpng
src		src
zlib		zlib
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
NOTES.md		NOTES.md
README.md		README.md
TODO.txt		TODO.txt
clean_cmake.sh		clean_cmake.sh
config.h.in		config.h.in
run_cmake_linux.sh		run_cmake_linux.sh
run_cmake_mac.sh		run_cmake_mac.sh
run_cmake_nacl.sh		run_cmake_nacl.sh

Lervard/algorithm

Folders and files

Latest commit

History

Repository files navigation

algorithm

编译

Linux

Windows

目录说明

关于benchmark的一些注记

一些测试结果

排序

命令

结果

结论

整数自增

命令

结果

结论

About

Resources

Stars

Watchers

Forks