Skip to content

AlbertVeli/fast-grep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal multi-threaded grep.

No bells, no whistles, no case insensitivity, no regular expressions. Just plain text search. Works similar to:

sed -n '/needle/p' haystack.txt

This will print all lines containing the string needle in the file haystack.txt. Because of the multi-threading the order of the printed lines will be somewhat random.

EXAMPLE

Search for lines containing the string '.se-' in the file cred, but skip lines containing the string '-|-|--':

./fast-grep -v '-|-|--' '.se-' cred

LIMITATIONS

  • Only tested with GNU/Linux and OS X.
  • Will fail for large files (> 2Gb) on 32-bit OS (possible to fix by using LFS and repeatedly calling mmap64 with 1 or 2 Gb in each iteration, this is currently not implemented).

BENCHMARK

Searching a 9.3G ascii-text file named cred with search string ".se-" (406006 matching lines).

Computers:

  • HP 8560w laptop (Core i7m, Linux), GNU sed 4.2.1, GNU grep 2.14, perl v5.12.4, Python 2.7.5
  • Mac (Core i5, OS X), BSD sed (version?), BSD grep 2.5.1, perl v5.16.2, Python 2.7.5
  • Ubuntu (Xeon W3550), GNU sed 4.2.1, GNU grep 2.10, perl v5.14.2, Python 2.7.3

Commands:

  1. Sed

    time sed -n '/\.se-/p' cred > se.txt
  2. Perl

    time perl -e 'while (<>) { /\.se-/ && print; }' cred > se.txt
  3. Python

    time python2 -c 'for line in open("cred"):
    if ".se-" in line:
        print line.rstrip()' > se.txt
  4. Grep

    time grep '\.se-' cred > se.txt
  5. Fast-grep

    time ./fast-grep '.se-' cred > se.txt

Results:

Computer 1 sed 2 perl 3 python 4 grep 5 fast-grep
HP laptop 1m15s 1m7s 0m57s 0m19s 0m5s
Mac OS X 3m2s 0m45s 0m33s 2m29s 0m5s
Ubuntu 1m13s 1m14s 1m13s 1m11s 1m17s
(1m10s w/o threading)

Note: fast-grep runs considerably slower the first time. After the file contents are cached in memory it performs good. This requires a computer with sufficient RAM, otherwise the filesystem becomes a bottleneck. None of the benchmarks in the table are from the first run. All programs had an equal chance to cache the file contents.

Note2: Sed and grep runs faster on Linux while perl and python runs faster on OS X. To be completely fair the tests should be re-run on the same hardware. Sed and grep can be explained with different implementations (BSD/GNU) while the file reading seems to be faster on the Linux machine (ext4/HFS+).

Note3: The Ubuntu machine didn't have enough RAM to cache the whole file in memory, so all programs had to do a lot of disk accesses. In this case it really measures the speed of the disk rather than the speed of the program.

LICENSE

~~=) All Rights Reversed - No Rights Reserved (=~~

Setting Orange, the 28th day of The Aftermath in the YOLD 3179

Albert Veli

About

Minimal multi-threaded grep

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages