TopSig is a signature indexing and retrieval platform based around storing compact document signatures and enabling comparisons and retrieval using these signatures.
./topsig invocation-mode [options...]
The invocation mode specifies the overall operation TopSig is being instructed to run. The options are used to provide all other configuration information, including the paths to input and output files.
index
: When invoked in this mode, TopSig will read in the collection specified through-target-path
, generate document signatures for each document in the collection and write results to the signature file specified with-signature-path
.query
: When invoked in this mode, TopSig will take a single text query through the configuration option-query-text
and will use this query to search the signature file specified with-signature-path
.topic
: When invoked in this mode, TopSig will take a file containing one or more queries through the configuration option-topic-path
and will search the collection specified through-signature-path
with these queries.termstats
: When invoked in this mode, TopSig will read in the collection specified through-target-path
and will write out term statistics to the path specified through-termstats-path
. These term statistics can then be provided to TopSig when run in other invocation modes (such asindex
,query
andtopic
) with the-termstats-path
option.create-issl
: When invoked in this mode, TopSig will read in the signature file specified through-signature-path
and write out an inverted signature slice list (ISSL) table to the path specified through-issl-path
.search-issl
: When invoked in this mode, TopSig will read in the ISSL table and signature file specified with the-issl-path
and-signature-path
options and perform accelerated pairwise searches against signatures in this collection.
Configuration options can be passed to TopSig through command-line arguments, configuration files or both.
When configuration options are passed through the command line, they are prefixed with a -
and followed by the value that the configuration option is to be set to. For instance, to create a signature file named output.sig
from the files inside a directory named collections
, the following invocation would be used:
./topsig index -target-path collection -signature-path output.sig
Configuration options can also be passed through configuration files, which have one configuration option per line (without the -
prefix), followed by an =
, followed by the value that the configuration option is to be set to.
By default, TopSig will read configuration information from a file named config.txt
in the working directory TopSig is invoked from, providing it exists. Additional configuration files can also be specified through the -config
configuration option. So, for example, the above invocation could also be reproduced by running TopSig with just the invocation mode:
./topsig index
...if the following config.txt
file is also present:
target-path = collection
signature-path = output.sig
If the settings are in a different file, e.g. settings.cfg
, the path will need to be provided through -config
:
./topsig index -config settings.cfg
Using this option you can keep different configuration files around to make keeping track of the settings used to obtain different results easier.
These options provide information about the data collection used in indexing. These options are used when TopSig is invoked in the index
and termstats
modes.
This option points to the collection. This can either be a file or a directory. If it is a directory, every file within the directory will be processed.
This option points to a secondary collection file / directory that will be processed after the file pointed to by -target-path
and added to the same signature file. Any number of additional collection files / directories can be added through the -target-path-3
-target-path-4
etc. options.
This option specifies the format that the collection files are in (if a directory was passed to -target-path
, this option specifies the format of the files inside that directory).
Valid options include:
file
(default): File contains a single documenttar
: File is a TAR format archive containing multiple files, each of which is a documentwsj
: File is a TREC format SGML file consisting of multiple documents delimited with<DOC> </DOC>
tags and with document filenames specified through<DOCNO> </DOCNO>
tags.warc
: File is a WARC format collection file, as used in ClueWeb09 and ClueWeb12, and consists of multiple documents, with document filenames specified withWARC-TREC-ID:
labels.newline
: File is a text file, with each line considered to be a separate document. Files are named by line number.
This option specifies the compression mode used to compress the files pointed to by -target-path
.
Valid options include:
none
(default): File is uncompressedgz
: File is compressed with gzipbz2
: File is compressed with bzip2
This option specifies a filter that the documents are run through when determining which parts of the document consist of text (which should be indexed) and which parts consist of markup (which should be ignored).
Valid options include:
none
(default): No filter is usedxml
: A basic filter that strips XML tags is used. This is useful for cleaning up collections such as Wikipedia, which contain a lot of formatting information, most of which is not useful.
When documents are indexed into signatures, they are given an identifier based on the original document that can be used to identify this document later on. While certain formats (-target-format
wsj
and warc
) explicitly provide the document ID, when the format is file
or tar
this option selects how TopSig will determine the document ID.
Valid options include:
path
(default): The path to the file, including the file extension. For TAR archives this is the full path stored in the TAR; for files/directories, this is the path as provided to TopSig. For instance, if-target-path
is specified ascollection
andcollection
is a directory containing files1.txt
,2.txt
and3.txt
, the document IDs of those files will be stored ascollection/1.txt
,collection/2.txt
,collection/3.txt
.basename
: The directory path (everything up to the last directory separator) and the extension will be stripped- with the file structure used in the previous example, this would mean document IDs of1
,2
and3
.basename.ext
: The directory path will be stripped from the filename, but the extension will be kept:1.txt
,2.txt
,3.txt
.xmlfield
: The filename will be ignored, and the document ID instead pulled out of an XML field inside the folder. The XML field in question must be specified through the-xml-docid-field
option. For example, if a document is indexed with the options-docid-format xmlfield -xml-docid-format docname
the document ID will be pulled from between the<docname> </docname>
tags in the file.
This option specifies that TopSig is to index documents that are too long into multiple signatures. This can be desirable for two reasons:
- Signature quality tends to deteriorate when the documents contain too many terms for the signature size. Splitting up documents helps to avoid this - documents that are larger than the specified threshold will be split up and searches will continue to work.
- For certain modes of retrieval it is beneficial to be able to operate at the passage or sentence level, rather than at whole document level. TopSig records the starting and ending offsets within the original file of each signature making focused search possible with splitting.
Valid options include:
none
(default): No splitting takes placehard
: Documents are split immediately upon reaching the term threshold specified with the-split-max
option (default: 512)sentence
: Documents are split either upon reaching the-split-max
threshold, or after reaching the-split-min
threshold (default: 256) and encountering a full stop character (.
).