UTF-32

Source: Wikipedia, the free encyclopedia.

UTF-32 (32-

Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits).[1]
UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the Nth code point in a sequence of code points is a

linear-time to count N code points from the start of the string. This makes UTF-32 a simple replacement in code that uses integers that are incremented by one to examine each location in a string, as was commonly done for ASCII. However, Unicode code points are rarely processed in complete isolation, such as combining character sequences and for emoji.[2]

The main disadvantage of UTF-32 is that it is space-inefficient, using four

BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of UTF-16. It can be up to four times the size of UTF-8 depending on how many of the characters are in the ASCII subset.[2]

History

The original

Universal Character Set (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32.[3][1] Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF[4] these areas were removed in later versions. Because the Principles and Procedures document of ISO/IEC JTC 1/SC 2 Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.[5]

Utility of fixed width

A fixed number of bytes per code point has a number of theoretical advantages, but each of these has problems in reality:

  • Truncation becomes easier, but not significantly so compared to UTF-8 and UTF-16 (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).[a][citation needed]
  • Finding the Nth character in a string. For fixed width, this is simply a O(1) problem, while it is O(n) problem in a variable-width encoding.[6] Amateur programmers often vastly overestimate how useful this is: in reality an algorithm that knows n without first examining the n-1 characters before it (an O(n) problem) are very rare or non-existent.[citation needed] In addition Unicode code points are often not equivalent to what the user thinks is a "character", for instance both of these Emoji are 3 code points: "👨‍🦲 Man: Bald"[7] and "👩‍🦰 Woman: Red Hair".[8][9]
  • Quickly knowing the "width" of a string. In practice, even with a "fixed width" font and restricting the characters to the BMP, finding the string width from a count of code points is impossible. There are combining forms like 'é' as expressed using two code points 'e' + ' ́ ' and "fixed width" may assign a width of 2 to CJK ideographs, and some code points take multiple character positions per code point ("grapheme clusters" for CJK).[6]

Use

The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters. For instance, in modern text rendering, it is common[citation needed] that the last step is to build a list of structures each containing coordinates (x,y), attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.[citation needed]

Use of UTF-32 strings on Windows (where wchar_t is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type

wchar_t being defined as 32 bit. Python versions up to 3.2 can be compiled to use them instead of UTF-16; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.[10] Seed7[11] and Lasso[citation needed] programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the Julia programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package[12]) following the "UTF-8 Everywhere Manifesto".[13]

Variants

Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the

WTF-8 variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to CESU-8
. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.

See also


Notes

  1. ^ For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.

References

  1. ^ a b Constable, Peter (2001-06-13). "Mapping codepoints to Unicode encoding forms". Computers and Writing Systems - SIL International. Retrieved 2022-10-03.
  2. ^ a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM". Unicode. Retrieved 2022-09-04.
  3. ^ "Publicly Available Standards - ISO/IEC 10646:2020". ISO Standards. Retrieved 2021-10-12. Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] consisting of the integers from 0 to 10 FFFF (hexadecimal)". Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".
  4. ^ "Annex B - The Universal Character Set (UCS)". DKUUG Standardizing. Archived from the original on Jan 22, 2022. Retrieved 2022-10-03.
  5. . It [UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646.
  6. ^ a b Goregaokar, Manish (January 14, 2017). "Let's Stop Ascribing Meaning to Code Points". In Pursuit of Laziness. Retrieved 2020-06-14. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.
  7. ^ "👨‍🦲 Man: Bald Emoji". Emojipedia. Retrieved 2021-10-12.
  8. ^ "👩‍🦰 Woman: Red Hair Emoji". Emojipedia. Retrieved 2021-10-12.
  9. ^ "↔️ Emoji ZWJ (Zero Width Joiner) Sequences". emojipedia.org. Retrieved 2021-10-12.
  10. ^ Löwis, Martin. "PEP 393 -- Flexible String Representation". python.org. Python. Retrieved 26 October 2014.
  11. ^ "The usage of UTF-32 has several advantages".
  12. ^ JuliaStrings/LegacyStrings.jl: Legacy Unicode string types, JuliaStrings, 2019-05-17, retrieved 2019-10-15
  13. ^ "UTF-8 Everywhere Manifesto".

External links

This page is based on the copyrighted Wikipedia article: UTF-32. Articles is available under the CC BY-SA 3.0 license; additional terms may apply.Privacy Policy