A dictionary search engine
This is a kind of search engine for dictionaries. It allows searching words in a dictionary with approximate matching (aka fuzzy matching) and pattern matching. In response to a query, it returns a list of matching words, sorted by decreasing similarity to the query. Informations necessary for paginating results efficiently are also provided to the caller.
Several fuzzy matching metrics are supported:
- Levenshtein distance
- Damerau-Levenshtein distance
- Longest common substring
- Longest common subsequence
The library is available in source form, as an amalgamation. Compile
volubile.c
together with your source code, and use the interface described in
volubile.h
.
The following must also be compiled with your code:
These are C11 source files, so you must use a modern C compiler, which means either GCC or CLang on Unix.
A Lua binding is also available. See the file README.md
in the lua
directory
for instructions about how to build and use it.
There is a concrete usage example in the file
example.c
in this directory. It searches the lexicon encoded in example_lexicon.dat
.
Compile the example program with make
, and use it like this:
$ ./example '@redundant' # find words similar to "redundant"
redundant
redundancy
redundantly
redounding
refunding
=> [76155 3]
Only the five words most similar to "redundant
" are shown, plus two numbers
that encode informations necessary to access the next results page. We can
display the next five words by issuing the same query together with these
numbers:
$ ./example '@redundant' 76155 3
reluctant
remnant
repentant
repugnant
resonant
=> [77312 3]
This process can be repeated again to access the remaining matching words.
To search inside a lexicon, you must first encode it as a numbered automaton in
the mini
format. This can be done
with the mini
command-line tool, or through the corresponding programmatic
interface. With the command-line tool, you can create a lexicon with a command
like the following:
$ mini create -t numbered lexicon.dat < /usr/share/dict/words
Note the use of the -t
switch.
Several string matching modes are available. When the selected one is VB_AUTO
,
a few characters have a special meaning when they appear at the beginning of a
query:
#apple find words that have "apple" as a substring; equivalent to the
pattern *apple*
+apple find words that have the longest possible substring in common with
"apple"
@apple find words that resemble "apple" (using the Damerau-Levenshtein
distance as metric)
The Levenshtein distance and longest common subsequence metrics can only be selected programmatically.
Glob matching is case-sensitive, and is performed over whole words. The supported syntax is as follows:
? matches a single character
* matches zero or more characters
[abc] matches any of the characters a, b, or c
[^abc] matches any character but a, b, and c
Glob characters are only given a special meaning when the selected matching made
is VB_AUTO
or VB_GLOB
. Otherwise, they are interpreted as literals.
The characters [
, ?
, and *
are interpreted as literals when in a group.
The character ]
, to be used in a group, must be placed in first position. The
character ^
, if included in a group and intended to be interpreted as a
literal, must not be placed at the beginning of the group. The character ]
, if
not preceded by [
, is interpreted as a literal.