Charset detection
Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy.
This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of
In general, incorrect charset detection leads to mojibake.
One of the few cases where charset detection works reliably is detecting
Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with ASCII and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognizing them relies on identifying language features, such as letter frequencies or spellings.
Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. See Character encodings in HTML#Specifying the document's character encoding. Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixed byte order mark (BOM).
See also
- International Components for Unicode – a library that can perform charset detection
- Language identification
- Content sniffing
- Browser sniffing – a similar heuristic technique for determining the capabilities of a web browser, before serving content to it
References
External links
- IMultiLanguage2::DetectInputCodepage
- API reference for ICU charset detection
- Reference for cpdetector charset detection
- Mozilla Charset Detectors
- Java port of Mozilla Charset Detectors
- Delphi/Pascal port of Mozilla Charset Detectors
- uchardet, C++ fork of Mozilla Charset Detectors; includes Bash command-line tool
- C# port of Mozilla Charset Detectors
- HEBCI, a technique for detecting the character set used in form submissions
- Frequency distributions of English trigraphs