Normalized compression distance
Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other.
It can be used in information retrieval and data mining for cluster analysis.
Information distance
We assume that the objects one talks about are finite
up to logarithmic additive terms which can be ignored. This information distance is shown to be a
Normalized information distance (similarity metric)
The information distance is absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we consider that those strings are relatively more similar than two strings of 1000 bits that differ by 1000 bits. Hence we need to normalize to obtain a similarity metric. This way one obtains the normalized information distance (NID),
where is
Normalized compression distance
While the NID metric is not computable, it has an abundance of applications. Simply approximating by real-world compressors, with is the binary length of the file compressed with compressor Z (for example "
The NCD is actually a family of distances parametrized with the compressor Z. The better Z is, the closer the NCD approaches the NID, and the better the results are.[3]
Applications
The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees.[2][3] It can also be used for new applications of general clustering and classification of natural data in arbitrary domains,[3] for clustering of heterogeneous data,[3] and for anomaly detection across domains.[5] The NID and NCD have been applied to numerous subjects, including music classification,[3] to analyze network traffic and cluster computer worms and viruses,[6] authorship attribution,[7] gene expression dynamics,[8] predicting useful versus useless stem cells,[9] critical networks,[10] image registration,[11] question-answer systems.[12]
Performance
Researchers from the
NCD has an advantage of being robust to noise.[13] However, although NCD appears "parameter-free", practical questions include which compressor to use in computing the NCD and other possible problems.[14]
Comparison with the Normalized Relative Compression (NRC)
In order to measure the information of a string relative to another there is the need to rely on relative semi-distances (NRC).[15] These are measures that do not need to respect symmetry and triangle inequality distance properties. Although the NCD and the NRC seem very similar, they address different questions. The NCD measures how similar both strings are, mostly using the information content, while the NRC indicates the fraction of a target string that cannot be constructed using information from another string. For a comparison, with application to the evolution of primate genomes, see.[16]
Normalized Google distance
Objects can be given literally, like the literal four-letter
where denotes the number of pages containing the search term , and denotes the number of pages containing both and ,) as returned by Google or any search engine capable of returning an aggregate page count. The number can be set to the number of pages indexed although it is more proper to count each page according to the number of search terms or phrases it contains. As rule of the thumb one can multiply the number of pages by, say, a thousand...[17]
See also
References
- ^ a b C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitányi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, IT-44:4(1998) 1407–1423
- ^ S2CID 221927.
- ^ S2CID 911.
- S2CID 10831035.
- ^ S2CID 1616057.
- ^ "S. Wehner, Analyzing worms and network traffic using compression, Journal of Computer Security, 15:3(2007), 303–320". Iospress.metapress.com. Retrieved 2012-11-03.
{{cite journal}}
: Cite journal requires|journal=
(help) - .
- PMID 18250330.
- S2CID 18652440.
- S2CID 5760862.
- S2CID 12549455.
- S2CID 3040254.
- S2CID 15691266.
- ^ "M. Cebrián, M. Alfonseca, and A. Ortega, Common pitfalls using the normalized compression distance: what to watch out for in a compressor, Commun. Inf. Syst. 5:4(2005), 367–384". Projecteuclid.org. Retrieved 2012-11-03.
- .
- PMID 33265483. Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.
- S2CID 59777.
External links
- Efficient Estimation of Word Representations in Vector Space
- P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, New York, 4th Edition 2019