GitHub - data-search/c-bfscan: Implementations of brute force scans for document retrieval in C

The data locates at: /fs/clip-twitter/ylwang/bfscan-statistics/All/

Positional representation

The positional representation is simply a sequence of termids, one for each document position:

That is, term 352 appears in position 0, term 931 appears in position 1, etc.

TF representation

The TF representation requires two parallel arrays, and in essence "pre-aggregates" the TF within each document:

In other words, term 352 appears twice (i.e., has tf = 2).

Implementation 1

For each topic:

Loop over all documents
Loop over terms in document
Loop over all query terms
Compute score

Implementation 2

For each topic:

Loop over all documents
Loop over all query terms
Loop over terms in document
Compute score

Note that implementations 1 and 2 differ in the order of the document terms and query terms loop.

Implementation 3

For each topic:

Loop over all documents
Separate code path for queries of different lengths
Loop over terms in document
Compute score

Implementation 4

For each topic:

Loop over all documents
Separate code path for queries of different lengths
Use a dispatch table to dispatch to a separate function to process documents of different lengths

Implementation 5

For each topic:

Loop over an array of function pointers that points to functions that process documents of different lengths
Separate code path for queries of different lengths

So we have:

bfscan_pos_v1, bfscan_pos_v2, bfscan_pos_v3, bfscan_pos_v4, bfscan_pos_v5
bfscan_tf_v1, bfscan_tf_v2, bfscan_tf_v3, bfscan_tf_v4, bfscan_tf_v5

All of these process a single query at a time (then we have separate versions that process different numbers of queries at a time).

Performance

Here we show the latency per query in milliseconds for two different implementations.

            v1   v2   v3   v4   v5
positional  815  630  430
tf          750  670  405  610  615

Baselines(milliseconds per query):

Lucene              200
raw_scan            240
raw_scan_24bits     350

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
AVXScan1.c		AVXScan1.c
AVXScan1_multithread_interquery.c		AVXScan1_multithread_interquery.c
AVXScan1_multithread_intraquery.c		AVXScan1_multithread_intraquery.c
AVXScan2.c		AVXScan2.c
AVXScan2_Impactscore.c		AVXScan2_Impactscore.c
AVXScan2_Impactscore_multithread_interquery.c		AVXScan2_Impactscore_multithread_interquery.c
AVXScan2_Impactscore_multithread_intraquery.c		AVXScan2_Impactscore_multithread_intraquery.c
AVXScan2_multithread_interquery.c		AVXScan2_multithread_interquery.c
AVXScan2_multithread_intraquery.c		AVXScan2_multithread_intraquery.c
Makefile		Makefile
README.md		README.md
Scan1.c		Scan1.c
Scan1_multithread_interquery.c		Scan1_multithread_interquery.c
Scan1_multithread_intraquery.c		Scan1_multithread_intraquery.c
Scan1_pos.c		Scan1_pos.c
Scan1_pos_multithread_interquery.c		Scan1_pos_multithread_interquery.c
Scan1_pos_multithread_intraquery.c		Scan1_pos_multithread_intraquery.c
Scan2.c		Scan2.c
Scan2_multithread_interquery.c		Scan2_multithread_interquery.c
Scan2_multithread_intraquery.c		Scan2_multithread_intraquery.c
Scan2_pos.c		Scan2_pos.c
Scan2_pos_multithread_interquery.c		Scan2_pos_multithread_interquery.c
Scan2_pos_multithread_intraquery.c		Scan2_pos_multithread_intraquery.c
constants.h		constants.h
data.c		data.c
heap.c		heap.c
heap.h		heap.h
run.md		run.md
termindexes.h		termindexes.h
termindexes_padding.h		termindexes_padding.h
threadpool.c		threadpool.c
threadpool.h		threadpool.h
topics2011.h		topics2011.h
topics2011_time.h		topics2011_time.h
topics_1000.h		topics_1000.h
topics_10000.h		topics_10000.h
topics_10000_time.h		topics_10000_time.h
topics_1000_time.h		topics_1000_time.h
topics_300.h		topics_300.h
topics_300_time.h		topics_300_time.h

data-search/c-bfscan

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages