String metric

In

string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.^[1]

A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the

token

, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in

DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, malware detection,^[3] and semantic knowledge integration

.

List of string metrics

Levenshtein distance, or its generalization edit distance
Damerau–Levenshtein distance
Sørensen–Dice coefficient
City block distance
Hamming distance
Simple matching coefficient (SMC)
Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance^[4]
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence
)

Skew divergence^[4]
Confusion probability^[4]
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)^[4]
Maximal matches^[4]
Grammar-based distance^[5]
TFIDF distance metric^[6]

There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense. An example of such function is the Jaro–Winkler distance.

Selected string measures examples

Name	Description	Example
Hamming distance	Only for strings of the same length. Number of changed characters.	"karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance	Generalization of Hamming distance that allows for different length strings, and (with Damerau) for transpositions	kitten and sitting have a distance of 3. kitten → sitten (substitution of "s" for "k") sitten → sittin (substitution of "i" for "e") sittin → sitting (insertion of "g" at the end).
Jaro–Winkler distance		JaroWinklerDist("MARTHA","MARHTA") = $d_{j}={\frac {1}{3}}\left({\frac {m}{\|s_{1}\|}}+{\frac {m}{\|s_{2}\|}}+{\frac {m-t}{m}}\right)={\frac {1}{3}}\left({\frac {6}{6}}+{\frac {6}{6}}+{\frac {6-{\frac {2}{2}}}{6}}\right)=0.944$ $m$ is the number of matching characters; $t$ is half the number of transpositions(`"MARTHA"[3]!=H, "MARHTA"[3]!=T`).
Most frequent k characters		MostFreqKeySimilarity('research', 'seeking', 2) = 2

References

S2CID 2091942
.

S2CID 207551224
.

^ Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.

^ ^a ^b ^c ^d ^e Sam's String Metrics - Computational Linguistics and Phonetics

^ Russell, David J., et al. "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences." BMC bioinformatics 11.1 (2010): 1-14.

^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78. {{cite journal}}: Cite journal requires |journal= (help)

External links

String Similarity Metrics for Information Integration A fairly complete overview Archive index at the Wayback Machine

Carnegie Mellon University open source library

StringMetric project a
Scala
library of string metrics and phonetic algorithms

Natural project a JavaScript natural language processing library which includes implementations of popular string metrics

v
t
e
Strings
String metric

Approximate string matching

Bitap algorithm

Damerau–Levenshtein distance

Edit distance

Gestalt pattern matching

Hamming distance

Jaro–Winkler distance

Lee distance

Levenshtein automaton

Levenshtein distance

Wagner–Fischer algorithm

String-searching algorithm

Apostolico–Giancarlo algorithm

Boyer–Moore string-search algorithm

Boyer–Moore–Horspool algorithm

Knuth–Morris–Pratt algorithm

Rabin–Karp algorithm

Raita algorithm

Trigram search

Two-way string-matching algorithm

Zhu–Takaoka string matching algorithm

Multiple string searching

Aho–Corasick

Commentz-Walter algorithm

Regular expression

Comparison of regular-expression engines

Regular grammar

Thompson's construction

Nondeterministic finite automaton

Sequence alignment

BLAST

Hirschberg's algorithm

Needleman–Wunsch algorithm

Smith–Waterman algorithm

Data structure

DAFSA

Suffix array

Suffix automaton

Suffix tree

Generalized suffix tree

Rope

Ternary search tree

Trie

Other

Parsing

Pattern matching

Compressed pattern matching

Longest common subsequence

Longest common substring

Sequential pattern mining

Sorting

String rewriting systems

String operations

Retrieved from "https://en.wikipedia.org/w/index.php?title=String_metric&oldid=1223884569"

[1] S2CID 2091942
.

[2] S2CID 207551224
.

[3] Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.

[sam-4] Sam's String Metrics - Computational Linguistics and Phonetics

[5] Russell, David J., et al. "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences." BMC bioinformatics 11.1 (2010): 1-14.

[6] Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[3]

[4]

[5]

[6]