GitHub - AlbertVeli/fast-grep: Minimal multi-threaded grep

Minimal multi-threaded grep.

No bells, no whistles, no case insensitivity, no regular expressions. Just plain text search. Works similar to:

sed -n '/needle/p' haystack.txt

This will print all lines containing the string needle in the file haystack.txt. Because of the multi-threading the order of the printed lines will be somewhat random.

EXAMPLE

Search for lines containing the string '.se-' in the file cred, but skip lines containing the string '-|-|--':

./fast-grep -v '-|-|--' '.se-' cred

LIMITATIONS

Only tested with GNU/Linux and OS X.
Will fail for large files (> 2Gb) on 32-bit OS (possible to fix by using LFS and repeatedly calling mmap64 with 1 or 2 Gb in each iteration, this is currently not implemented).

BENCHMARK

Searching a 9.3G ascii-text file named cred with search string ".se-" (406006 matching lines).

Computers:

HP 8560w laptop (Core i7m, Linux), GNU sed 4.2.1, GNU grep 2.14, perl v5.12.4, Python 2.7.5
Mac (Core i5, OS X), BSD sed (version?), BSD grep 2.5.1, perl v5.16.2, Python 2.7.5
Ubuntu (Xeon W3550), GNU sed 4.2.1, GNU grep 2.10, perl v5.14.2, Python 2.7.3

Commands:

Sed
```
time sed -n '/\.se-/p' cred > se.txt
```

Perl

time perl -e 'while (<>) { /\.se-/ && print; }' cred > se.txt

Python

time python2 -c 'for line in open("cred"):
if ".se-" in line:
    print line.rstrip()' > se.txt

Grep
```
time grep '\.se-' cred > se.txt
```
Fast-grep
```
time ./fast-grep '.se-' cred > se.txt
```

Results:

Computer	1 sed	2 perl	3 python	4 grep	5 fast-grep
HP laptop	1m15s	1m7s	0m57s	0m19s	0m5s
Mac OS X	3m2s	0m45s	0m33s	2m29s	0m5s
Ubuntu	1m13s	1m14s	1m13s	1m11s	1m17s (1m10s w/o threading)

Note: fast-grep runs considerably slower the first time. After the file contents are cached in memory it performs good. This requires a computer with sufficient RAM, otherwise the filesystem becomes a bottleneck. None of the benchmarks in the table are from the first run. All programs had an equal chance to cache the file contents.

Note2: Sed and grep runs faster on Linux while perl and python runs faster on OS X. To be completely fair the tests should be re-run on the same hardware. Sed and grep can be explained with different implementations (BSD/GNU) while the file reading seems to be faster on the Linux machine (ext4/HFS+).

Note3: The Ubuntu machine didn't have enough RAM to cache the whole file in memory, so all programs had to do a lot of disk accesses. In this case it really measures the speed of the disk rather than the speed of the program.

LICENSE

~~=) All Rights Reversed - No Rights Reserved (=~~

Setting Orange, the 28th day of The Aftermath in the YOLD 3179

Albert Veli

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
args.c		args.c
args.h		args.h
fast-grep.c		fast-grep.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

args.c

args.c

args.h

args.h

fast-grep.c

fast-grep.c

Repository files navigation

About

Releases

Packages

Languages

AlbertVeli/fast-grep

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages