Skip to content

data-search/c-bfscan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The data locates at: /fs/clip-twitter/ylwang/bfscan-statistics/All/

Positional representation

The positional representation is simply a sequence of termids, one for each document position:

352
931
64
352
846
...

That is, term 352 appears in position 0, term 931 appears in position 1, etc.

TF representation

The TF representation requires two parallel arrays, and in essence "pre-aggregates" the TF within each document:

352  2
931  1
64   1
846  1
...

In other words, term 352 appears twice (i.e., has tf = 2).

Implementation 1

For each topic:

  • Loop over all documents
  • Loop over terms in document
  • Loop over all query terms
  • Compute score

Implementation 2

For each topic:

  • Loop over all documents
  • Loop over all query terms
  • Loop over terms in document
  • Compute score

Note that implementations 1 and 2 differ in the order of the document terms and query terms loop.

Implementation 3

For each topic:

  • Loop over all documents
  • Separate code path for queries of different lengths
  • Loop over terms in document
  • Compute score

Implementation 4

For each topic:

  • Loop over all documents
  • Separate code path for queries of different lengths
  • Use a dispatch table to dispatch to a separate function to process documents of different lengths

Implementation 5

For each topic:

  • Loop over an array of function pointers that points to functions that process documents of different lengths
  • Separate code path for queries of different lengths

So we have:

  • bfscan_pos_v1, bfscan_pos_v2, bfscan_pos_v3, bfscan_pos_v4, bfscan_pos_v5
  • bfscan_tf_v1, bfscan_tf_v2, bfscan_tf_v3, bfscan_tf_v4, bfscan_tf_v5

All of these process a single query at a time (then we have separate versions that process different numbers of queries at a time).

Performance

Here we show the latency per query in milliseconds for two different implementations.

            v1   v2   v3   v4   v5
positional  815  630  430
tf          750  670  405  610  615

Baselines(milliseconds per query):

Lucene              200
raw_scan            240
raw_scan_24bits     350

About

Implementations of brute force scans for document retrieval in C

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 87.3%
  • Objective-C 12.4%
  • Other 0.3%