Package-level declarations

Types

Link copied to clipboard
class Cosine(k: Int = 3)

Implements Cosine Similarity between strings.

Link copied to clipboard

Damerau-Levenshtein distance with transposition (unrestricted Damerau-Levenshtein distance).

Link copied to clipboard
class Jaccard(k: Int = 3)

Each input string is converted into a set of n-grams, the Jaccard index is then computed as |A ∩ B| / |A ∪ B|.

Link copied to clipboard
class JaroWinkler(threshold: Double = 0.7)

The Jaro–Winkler distance metric is designed and best suited for short strings such as person names, and to detect typos; it is (roughly) a variation of Damerau-Levenshtein, where the substitution of 2 close characters is considered less important than the substitution of 2 characters that a far from each other.

Link copied to clipboard

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.

Link copied to clipboard

The longest common subsequence (LCS) problem consists in finding the longest subsequence common to two (or more) sequences. It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences.

Link copied to clipboard
class MetricLCS

Distance metric based on Longest Common Subsequence.

Link copied to clipboard
class NGram(n: Int = 2)

N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance", String Processing and Information Retrieval, Lecture Notes in Computer Science Volume 3772, 2005, pp 115-126.

Link copied to clipboard

This distance is computed as levenshtein distance divided by the length of the longest string. The resulting value is always in the interval 0 to 1.

Link copied to clipboard

Used to indicate the cost of character operations (add, replace, delete). The cost should always be in the range [O, 1].

Link copied to clipboard

Implementation of the Optimal String Alignment (sometimes called the restricted edit distance) variant of the Damerau-Levenshtein distance.

Link copied to clipboard
class QGram(k: Int = 3)

Q-gram distance, as defined by Ukkonen in Approximate string-matching with q-grams and maximal matches. The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurrences of each n-gram).

Link copied to clipboard

The Ratcliff/Obershelp algorithm computes the similarity of two strings the doubled number of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence.

Link copied to clipboard
class SorensenDice(val k: Int = 3)

Sorensen-Dice coefficient, aka Sørensen index, Dice's coefficient or Czekanowski's binary (non-quantitative) index.

Link copied to clipboard

Implementation of Levenshtein that allows to define different weights for different character substitutions.