This is a repo for a future math-aware search engine.
No worry, "the day after tomorrow" is a temporary name, will come up a more catchy name for this project someday.
This project is originated from my previous math-only search engine prototype. There are some amazing ideas coming up since then, so a decision is made to rewrite most the code. This time, however, it has the ambition to become a production-level search engine with improved efficiency, and most importantly, to be able to handle queries with both math expressions and general text terms. In another word, to be "math-aware".
This project is still under early development so there is not really much to show here. But keep your eyes open and pull request is appreciated!
The meaning behind "the day after tomorrow"? Because reinventing a search engine in C has a long way to go, but as the saying goes:
Today is hard, tomorrow will be worse, but the day after tomorrow will be sunshine.
(Jack Ma)
$ git clone --depth=1 https://github.com/t-k-/the-day-after-tomorrow.git
The following instructions assume you have cloned this project in to directory $PROJECT
.
Other than commonly system build-in libraries (pthread, libz, libm, libstdc++), ther are some external dependencies you may need to download and install to your system environment:
For Debian/Ubuntu users, you can instead type the following commands to automatically install above dependencies:
$ sudo apt-get update
$ sudo apt-get install bison flex python3-pip python3-dev \
$ libtokyocabinet-dev libbz2-dev libz-dev
$ sudo pip3 install jieba
Lemur/Indri is not likely to be in your distribution's official software repository, so you may need to build and manually specify its library path (see the next step).
Lemur/Indri library is an important dependency for this project, currently this project relies on it to provide full-text index functionality (so that we avoid reinventing the wheel, and we can focus on math search implementation. To combine two search engines, simply merge their results and weight their scores accordingly).
After downloading Indri tarball (indri-5.9 for example), build its libraries:
$ (cd indri-5.9 && chmod +x configure && ./configure && make)
If Indri reports undefined reference to ...
when building/linking, install that library and rerun configure again:
After installing the zlib-devel package you must rerun configure so that it correctly finds it and adds the library to the ld command. (see https://sourceforge.net/p/lemur/discussion/546028/thread/e67752b2)
Our project uses dep-*.mk
files to configure most dependency paths (or CFLAGS and LDFLAGS). If you have installed above dependency libraries in your system environment, chances are you can just leave these dep-*.mk
files untouched.
One dependency path you probably have to specify manually is the Lemur/Indri library. If you have downloaded and compiled Lemur/Indri source code at ~/indri-5.9
, type:
$ ./configure --indri-path=~/indri-5.9
to setup build configuration. This configure
script also checks necessary libraries for building. If configure
outputs any library that can not be located by the linker, you may need to double check and install the missing dependency before build.
Type make
at project top level (i.e. $PROJECT
) will do the job.
This project is still in its early stage, nothing really to show you now. However, you can play with some existing commands:
-
Run our TeX parser to see the corresponding operator tree of a math expression
$ ./tex-parser/run/test-tex-parser.out edit: a+b/c return code:SUCC Operator tree: └──(plus) 2 son(s), token=ADD, path_id=1, ge_hash=0000, fr_hash=c301. │──[`a'] token=VAR, path_id=1, ge_hash=c401, fr_hash=8703. └──(frac) 2 son(s), token=FRAC, path_id=2, ge_hash=05f7, fr_hash=7203. │──#1[`b'] token=VAR, path_id=2, ge_hash=c501, fr_hash=7403. └──#2[`c'] token=VAR, path_id=3, ge_hash=c601, fr_hash=7503. Subpaths (leaf-root paths/total subpaths = 3/4): * VAR(0)/ADD(2)/[path_id=1: type=normal, leaf symbol=`a', fr_hash=8703] * VAR(0)/rank1(1)/FRAC(2)/ADD(2)/[path_id=2: type=normal, leaf symbol=`b', fr_hash=7403] * FRAC(2)/ADD(2)/[path_id=2: type=gener, ge_hash=05f7, fr_hash=7203] * VAR(0)/rank2(1)/FRAC(2)/ADD(2)/[path_id=3: type=normal, leaf symbol=`c', fr_hash=7503]
-
Index a corpus/collection and see its index statistics
$PROJECT/indexer/test-doc
includes a mini test corpus. Optionally, you are suggested to download a slightly larger plain text corpus (e.g. Reuters-21578 and Ohsumed from University of Trento CATEGORIZATION CORPORA) for performance evaluation. For non-trivial (reasonable large) corpus, you will have the chance to observe the index merging precess under default generated index directory ($PROJECT/indexer/tmp
).cd $PROJECT/indexer
and runrun/test-txt-indexer.out -p ./test-doc
to index corpus files recursively from our mini test corpus directory.- run
../term-index/run/test-read.out -s -p $PROJECT/indexer/tmp
to take a peek at the index (termN, docN, avgDocLen etc.) you just build. (Pass-h
argument to see more options fortest-read.out
program)
-
Test searching
By far, there are two search program you can play with: One is term search (or fulltext search) which uses Okapi BM25 scoring method and highlight the keywords in result directly in terminal. To perform fulltext search, pass
-p
option to speficy the index path you want to search on, e.g. for AND merge:$ $PROJECT/searchd/run/test-term-search.out -p ./tmp -t 'hacker' -t 'news' -o AND
for OR merge:
$ $PROJECT/searchd/run/test-term-search.out -p ./tmp -t 'hello' -t 'nick' -t 'wilde' -o OR
Alternatively, you can have an experience of math search by running:
$ $PROJECT/searchd/run/test-math-search.out -p ./tmp -t 'a+b'
For single TeX query, math search will always perform AND merge.
(boxes are external dependencies, circles are internal modules)
To generate this module dependency graph, issue commands below at the project top directory:
$ mkdir -p tmp
$ python3 proj-dep.py --targets > targets.mk
$ python3 proj-dep.py --dot > tmp/dep.dot
$ dot -Tpng tmp/dep.dot > tmp/dep.png
MIT
Whoever wants an efficient math-aware search to be reality and interested in making digital math-related content (e.g. a math Q&A website) searchable is invited to contribute in this project. Start your contributing by writing an issue, contribute a line of code, or a typo fix.