String metric

Source: Wikipedia, the free encyclopedia.

In

string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.[1]
A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the

token
, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in

.

List of string metrics

There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense. An example of such function is the Jaro–Winkler distance.

Selected string measures examples

Name Description Example
Hamming distance Only for strings of the same length. Number of changed characters. "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance Generalization of Hamming distance that allows for different length strings, and (with Damerau) for transpositions kitten and sitting have a distance of 3.
  1. kittensitten (substitution of "s" for "k")
  2. sittensittin (substitution of "i" for "e")
  3. sittinsitting (insertion of "g" at the end).
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA") =
  • is the number of matching characters;
  • is half the number of transpositions("MARTHA"[3]!=H, "MARHTA"[3]!=T).
Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2

References

  1. S2CID 2091942
    .
  2. .
  3. ^ Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.
  4. ^ a b c d e Sam's String Metrics - Computational Linguistics and Phonetics
  5. ^ Russell, David J., et al. "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences." BMC bioinformatics 11.1 (2010): 1-14.
  6. ^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78. {{cite journal}}: Cite journal requires |journal= (help)

External links