Charset detection

Source: Wikipedia, the free encyclopedia.

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of

language detection
. This process is not foolproof because it depends on statistical data.

In general, incorrect charset detection leads to mojibake.

One of the few cases where charset detection works reliably is detecting

ISO-8859
encoding before (or without) even testing to see if it was UTF-8.

UTF-16LE
, since all the byte pairs matched assigned Unicode characters in UTF-16LE.

Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with ASCII and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognizing them relies on identifying language features, such as letter frequencies or spellings.

Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. See Character encodings in HTML#Specifying the document's character encoding. Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixed byte order mark (BOM).

See also

References

External links