- Approximate matching and string distance calculations for R.
- All distance and matching operations are system- and encoding-independent.
The package offers the following main functions:
stringdist
computes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrix
computes the distance matrix for one or two vectorsstringsim
computes a string similarity between 0 and 1, based onstringdist
amatch
is a fuzzy matching equivalent of R's nativematch
functionain
is a fuzzy matching equivalent of R's native%in%
operatorseq_dist
,seq_distmatrix
,seq_amatch
andseq_ain
for distances between, and matching of integer sequences. (see also the hashr package).
These functions are built upon C
-code that re-implements some common (weighted) string
distance functions. Distance functions include:
- Hamming distance;
- Levenshtein distance (weighted);
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
- Full Damerau-Levenshtein distance (weighted);
- Longest Common Substring distance;
- Q-gram distance
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winker distance
- Soundex-based string distance.
Also, there are some utility functions:
qgrams()
tabulates the qgrams in one or morecharacter
vectors.seq_qrams()
tabulates the qgrams (somtimes called ngrams) in one or moreinteger
vectors.phonetic()
computes phonetic codes of strings (currently only soundex)printable_ascii()
is a utility function that detects non-printable ascii or non-ascii characters.
To install the latest release from CRAN, open an R terminal and type
install.packages('stringdist')
Beta versions are released through my drat repository. These versions build and pass all current tests correctly on Linux but builds have not been tested on all architectures that CRAN supports. Windows users will also need to have rtools installed.
drat::addRepo("markvanderloo")
install.packages("stringdist")
To obtain the package from the very latest source code open a bash
terminal (or git bash
if you work under Windows
with msysgit
) and type
git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz
Warning: the github version can change any time and may not even build properly. As most
of the code is written in C
, the development version may crash your R
-session.
Parallelization used to be based on R's parallel
package, that works by spawning several R sessions in the background. As of version 0.9.0, stringdist
uses the more efficient openMP
protocol to parallelize everything under the hood.
The following arguments have become obsolete and will be removed somewhere in 2016:
- Argument
cluster
for functionstringdistmatrix
. - Argument
maxDist
for functionsstringdist
andstringdistmatrix
(notamatch
). - Argument
ncores
for functionstringdistmatrix