UTF-32

UTF-32 (32-

Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 2³² Unicode code points, needing actually only 21 bits).^[1]

UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the Nth code point in a sequence of code points is a

linear-time to count N code points from the start of the string. This makes UTF-32 a simple replacement in code that uses integers that are incremented by one to examine each location in a string, as was commonly done for ASCII. However, Unicode code points are rarely processed in complete isolation, such as combining character sequences and for emoji.^[2]

The main disadvantage of UTF-32 is that it is space-inefficient, using four

BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of UTF-16. It can be up to four times the size of UTF-8 depending on how many of the characters are in the ASCII subset.^[2]

History

The original

Universal Character Set (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32.^[3]^[1] Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF^[4] these areas were removed in later versions. Because the Principles and Procedures document of ISO/IEC JTC 1/SC 2 Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.^[5]

Utility of fixed width

A fixed number of bytes per code point has a number of theoretical advantages, but each of these has problems in reality:

Truncation becomes easier, but not significantly so compared to UTF-8 and UTF-16 (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).^[a]^{[citation needed]}
Finding the Nth character in a string. For fixed width, this is simply a O(1) problem, while it is O(n) problem in a variable-width encoding.^[6] Amateur programmers often vastly overestimate how useful this is: in reality an algorithm that knows n without first examining the n-1 characters before it (an O(n) problem) are very rare or non-existent.^{[citation needed]} In addition Unicode code points are often not equivalent to what the user thinks is a "character", for instance both of these Emoji are 3 code points: "👨‍🦲 Man: Bald"^[7] and "👩‍🦰 Woman: Red Hair".^[8]^[9]
Quickly knowing the "width" of a string. In practice, even with a "fixed width" font and restricting the characters to the BMP, finding the string width from a count of code points is impossible. There are combining forms like 'é' as expressed using two code points 'e' + ' ́ ' and "fixed width" may assign a width of 2 to CJK ideographs, and some code points take multiple character positions per code point ("grapheme clusters" for CJK).^[6]

Use

The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters. For instance, in modern text rendering, it is common^{[citation needed]} that the last step is to build a list of structures each containing coordinates (x,y), attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.^{[citation needed]}

Use of UTF-32 strings on Windows (where wchar_t is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type

wchar_t being defined as 32 bit. Python versions up to 3.2 can be compiled to use them instead of UTF-16; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.^[10] Seed7^[11] and Lasso^{[citation needed]} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the Julia programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package^[12]) following the "UTF-8 Everywhere Manifesto".^[13]

Variants

Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the

WTF-8 variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to CESU-8

. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.

Notes

^ For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.

References

^ ^a ^b Constable, Peter (2001-06-13). "Mapping codepoints to Unicode encoding forms". Computers and Writing Systems - SIL International. Retrieved 2022-10-03.
^ ^a ^b "FAQ - UTF-8, UTF-16, UTF-32 & BOM". Unicode. Retrieved 2022-09-04.
^ "Publicly Available Standards - ISO/IEC 10646:2020". ISO Standards. Retrieved 2021-10-12. Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] consisting of the integers from 0 to 10 FFFF (hexadecimal)". Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".
^ "Annex B - The Universal Character Set (UCS)". DKUUG Standardizing. Archived from the original on Jan 22, 2022. Retrieved 2022-10-03.
ISBN 978-1-936213-01-6
. It [UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646.

^ ^a ^b Goregaokar, Manish (January 14, 2017). "Let's Stop Ascribing Meaning to Code Points". In Pursuit of Laziness. Retrieved 2020-06-14. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.

^ "👨‍🦲 Man: Bald Emoji". Emojipedia. Retrieved 2021-10-12.

^ "👩‍🦰 Woman: Red Hair Emoji". Emojipedia. Retrieved 2021-10-12.

^ "↔️ Emoji ZWJ (Zero Width Joiner) Sequences". emojipedia.org. Retrieved 2021-10-12.

^ Löwis, Martin. "PEP 393 -- Flexible String Representation". python.org. Python. Retrieved 26 October 2014.

^ "The usage of UTF-32 has several advantages".

^ JuliaStrings/LegacyStrings.jl: Legacy Unicode string types, JuliaStrings, 2019-05-17, retrieved 2019-10-15

^ "UTF-8 Everywhere Manifesto".

External links

The Unicode Standard 5.0.0, chapter 3 – formally defines UTF-32 in § 3.9, D90 (PDF page 40) and § 3.10, D99-D101 (PDF page 45)

Unicode Standard Annex #19 – formally defined UTF-32 for Unicode 3.x (March 2001; last updated March 2002)

Registration of new charsets: UTF-32, UTF-32BE, UTF-32LE – announcement of UTF-32 being added to the IANA charset registry (April 2002)

v
t
e
Unicode
Unicode

Unicode Consortium

ISO/IEC 10646 (Universal Character Set)

Versions

Code points

Block
List

Universal Character Set

Character charts

Character property

Plane

Private Use Area

Characters
Special purpose

BOM

Combining grapheme joiner

Left-to-right mark / Right-to-left mark

Soft hyphen

Variant form

Word joiner

Zero-width joiner

Zero-width non-joiner

Zero-width space

Lists

Characters

CJK Unified Ideographs

Combining character

Duplicate characters

Numerals

Scripts

Spaces

Symbols

Halfwidth and fullwidth

Alias names and abbreviations

Whitespace characters

Processing
Algorithms

Bidirectional text

Collation
ISO/IEC 14651

Equivalence

Variation sequences

International Ideographs Core

Comparison of encodings

BOCU-1

CESU-8

Punycode

SCSU

UTF-1

UTF-7

UTF-8

UTF-16/UCS-2

UTF-32/UCS-4

UTF-EBCDIC

On pairs of
code points

Combining character

Compatibility characters

Duplicate characters

Equivalence

Homoglyph

Precomposed character
list

Z-variant

Variation sequences

Regional indicator symbol

Emoji skin color

Usage

Domain names (IDN)

Email

Fonts

HTML
entity references

numeric references

Input

International Ideographs Core

Related standards

Common Locale Data Repository (CLDR)

GB 18030

ISO/IEC 8859

ISO 15924

Related topics

Anomalies

ConScript Unicode Registry

Ideographic Research Group

International Components for Unicode

People involved with Unicode

Han unification

Scripts and symbols in Unicode
Common and
inherited scripts

Combining marks

Diacritics

Punctuation marks

Spaces

Numbers

Modern scripts

Adlam

Arabic

Armenian

Balinese

Bamum

Batak

Bengali

Bopomofo

Braille

Buhid

Burmese

Canadian Aboriginal

Chakma

Cham

Cherokee

CJK Unified Ideographs (Han)

Cyrillic

Deseret

Devanagari

Geʽez

Georgian

Greek

Gujarati

Gunjala Gondi

Gurmukhi

Hangul

Hanifi Rohingya

Hanja

Hanunuoo

Hebrew

Hiragana

Javanese

Kanji

Kannada

Katakana

Kayah Li

Khmer

Lao

Latin

Lepcha

Limbu

Lisu (Fraser)

Lontara

Malayalam

Masaram Gondi

Mende Kikakui

Medefaidrin

Miao (Pollard)

Mongolian

Mru

N'Ko

Nag Mundari

New Tai Lue

Nüshu

Nyiakeng Puachue Hmong

Odia

Ol Chiki

Osage

Osmanya

Pahawh Hmong

Pau Cin Hau

Pracalit (Newa)

Ranjana

Rejang

Samaritan

Saurashtra

Shavian

Sinhala

Sorang Sompeng

Sundanese

Syriac

Tagbanwa

Tai Le

Tai Tham

Tai Viet

Tamil

Tangsa

Telugu

Thaana

Thai

Tibetan

Tifinagh

Tirhuta

Toto

Vai

Wancho

Warang Citi

Yi

Ancient and
historic scripts

Ahom

Anatolian hieroglyphs

Ancient North Arabian

Avestan

Bassa Vah

Bhaiksuki

Brāhmī

Carian

Caucasian Albanian

Coptic

Cuneiform

Cypriot

Cypro-Minoan

Dives Akuru

Dogra

Egyptian hieroglyphs

Elbasan

Elymaic

Glagolitic

Gothic

Grantha

Hatran

Imperial Aramaic

Inscriptional Pahlavi

Inscriptional Parthian

Kaithi

Kawi

Kharosthi

Khitan small script

Khojki

Khudawadi

Khwarezmian (Chorasmian)

Linear A

Linear B

Lycian

Lydian

Mahajani

Makasar

Mandaic

Manichaean

Marchen

Meetei Mayek

Meroitic

Modi

Multani

Nabataean

Nandinagari

Ogham

Old Hungarian

Old Italic

Old Permic

Old Persian cuneiform

Old Sogdian

Old Turkic

Old Uyghur

Palmyrene

ʼPhags-pa

Phoenician

Psalter Pahlavi

Runic

Sharada

Siddham

Sogdian

South Arabian

Soyombo

Sylheti Nagri

Tagalog (Baybayin)

Takri

Tangut

Ugaritic

Vithkuqi

Yezidi

Zanabazar Square

Notational scripts

Duployan

SignWriting

Symbols, emojis

Cultural, political, and religious symbols

Currency

Control Pictures

Mathematical operators and symbols
List by subject

Phonetic symbols (including IPA)

Emoji

Category: Unicode

Category: Unicode blocks

v
t
e
Character encodings
Early telecommunications

Telegraph code
Needle

Morse
Non-Latin

Wabun/Kana

Chinese

Cyrillic

Korean

Baudot and Murray

Fieldata

ASCII
ISO/IEC 646

BCDIC

Teletex and Videotex/Teletext
T.51/ISO/IEC 6937

ITU T.61

ITU T.101

World System Teletext
background

sets

Transcode

ISO/IEC 8859

Approved parts
-1 (Western Europe)

-2 (Central Europe)

-3 (Maltese/Esperanto)

-4 (North Europe)

-5 (Cyrillic)

-6 (Arabic)

-7 (Greek)

-8 (Hebrew)

-9 (Turkish)

-10 (Nordic)

-11 (Thai)

-13 (Baltic)

-14 (Celtic)

-15 (New Western Europe)

-16 (Romanian)

Abandoned parts
-12 (Devanagari)

Proposed but not approved
KOI-8 Cyrillic

Sámi

Adaptations
Welsh

Barents Cyrillic

Estonian

Ukrainian Cyrillic

Bibliographic use

MARC-8
ANSEL

CCCII/EACC

ISO 5426

5426-2

5427

5428

6438

6862

National standards

ArmSCII

Big5

BraSCII

CNS 11643

DIN 66003

ELOT 927

GOST 10859

GB 2312

GB 12345

GB 12052

GB 18030

HKSCS

ISCII

JIS X 0201

JIS X 0208

JIS X 0212

JIS X 0213

KOI-7

KPS 9566

KS X 1001

KS X 1002

LST 1564

LST 1590-4

PASCII

Shift JIS

SI 960

TIS-620

TSCII

VISCII

VSCII

YUSCII

ISO/IEC 2022

ISO/IEC 8859

ISO/IEC 10367

Extended Unix Code / EUC

Mac OS Code pages
("scripts")

Armenian

Arabic

Barents Cyrillic

Celtic

Central European

Croatian

Cyrillic

Devanagari

Farsi (Persian)

Font X (Kermit)

Gaelic

Georgian

Greek

Gujarati

Gurmukhi

Hebrew

Iceland

Inuit

Keyboard

Latin (Kermit)

Maltese/Esperanto

Ogham

Roman

Romanian

Sámi

Turkish

Turkic Cyrillic

Ukrainian

VT100

DOS code pages

437

668

708

720

737

770

773

775

776

777

778

850

851

852

853

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

897

899

903

904

932

936

942

949

950

951

1034

1040

1042

1043

1044

1098

1115

1116

1117

1118

1127

3846

ABICOMP

CS Indic

CSX Indic

CSX+ Indic

CWI-2

Iran System

Kamenický

Mazovia

MIK

IBM AIX code pages

895

896

912

915

921

922

1006

1008

1009

1010

1012

1013

1014

1015

1016

1017

1018

1019

1046

1124

1133

Windows code pages

CER-GS

932

936 (GBK)

950

1169

Extended Latin-8

1250

1251

1252

1253

1254

1255

1256

1257

1258

1270

Cyrillic + Finnish

Cyrillic + French

Cyrillic + German

Polytonic Greek

EBCDIC code pages

Japanese language in EBCDIC

DKOI

DEC terminals (VTx)

Multinational (MCS)

National Replacement (NRCS)
French Canadian

Swiss

Spanish

United Kingdom

Dutch

Finnish

French

Norwegian and Danish

Swedish

Norwegian and Danish (alternative)

8-bit Greek

8-bit Turkish

SI 960

Hebrew

Special Graphics

Technical (TCS)

Platform specific

1052

1053

1054

1055

1056

1057

1058

Acorn RISC OS

Amstrad CPC

Apple II

ATASCII

Atari ST

BICS

Casio calculators

CDC

Compucolor 8001

Compucolor II

CP/M+

DEC RADIX 50

DEC MCS/NRCS

DG International

Galaksija

GEM

GSM 03.38

HP Roman

HP FOCAL

HP RPL

SQUOZE

LICS

LMBCS

MSX

NEC APC

NeXT

PETSCII

PostScript Standard

PostScript Latin 1

SAM Coupé

Sega SC-3000

Sharp calculators

Sharp MZ

Sinclair QL

Teletext

TI calculators

TRS-80

Ventura International

WISCII

XCCS

ZX80

ZX81

ZX Spectrum

Unicode / ISO/IEC 10646

UTF-1

UTF-7

UTF-8

UTF-16

UTF-32

UTF-EBCDIC

GB 18030

DIN 91379

BOCU-1

CESU-8

SCSU

TACE16

Comparison of Unicode encodings

TeX typesetting system

Cork

LY1

OML

OMS

OT1

Miscellaneous code pages

ABICOMP

ASMO 449

Digital encoding of APL symbols
ISO-IR-68

ARIB STD-B24

Fieldata

HZ

IEC-P27-1

INIS
7-bit

8-bit

ISO-IR-169

ISO 2033

KOI
KOI8-R

KOI8-RU

KOI8-U

Mojikyō

SEASCII

Stanford/ITS

Symbol

TRON

Unified Hangul Code

Control character

Morse prosigns

C0 and C1 control codes
ISO/IEC 6429

JIS X 0211

Unicode control, format and separator characters

Whitespace characters

Related topics

CCSID

Character encodings in HTML

Charset detection

Han unification

Hardware code page

MICR code

Mojibake

Variable-length encoding

Character sets

Retrieved from "https://en.wikipedia.org/w/index.php?title=UTF-32&oldid=1214697189"

[6] For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.

[4_or_3_bytes-1] Constable, Peter (2001-06-13). "Mapping codepoints to Unicode encoding forms". Computers and Writing Systems - SIL International. Retrieved 2022-10-03.

[:0-2] "FAQ - UTF-8, UTF-16, UTF-32 & BOM". Unicode. Retrieved 2022-09-04.

[3] "Publicly Available Standards - ISO/IEC 10646:2020". ISO Standards. Retrieved 2021-10-12. Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] consisting of the integers from 0 to 10 FFFF (hexadecimal)". Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".

[4] "Annex B - The Universal Character Set (UCS)". DKUUG Standardizing. Archived from the original on Jan 22, 2022. Retrieved 2022-10-03.

[5] ISBN 978-1-936213-01-6
. It [UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646.

[manishearth-7] Goregaokar, Manish (January 14, 2017). "Let's Stop Ascribing Meaning to Code Points". In Pursuit of Laziness. Retrieved 2020-06-14. Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.

[8] "👨‍🦲 Man: Bald Emoji". Emojipedia. Retrieved 2021-10-12.

[9] "👩‍🦰 Woman: Red Hair Emoji". Emojipedia. Retrieved 2021-10-12.

[10] "↔️ Emoji ZWJ (Zero Width Joiner) Sequences". emojipedia.org. Retrieved 2021-10-12.

[11] Löwis, Martin. "PEP 393 -- Flexible String Representation". python.org. Python. Retrieved 26 October 2014.

[12] "The usage of UTF-32 has several advantages".

[13] JuliaStrings/LegacyStrings.jl: Legacy Unicode string types, JuliaStrings, 2019-05-17, retrieved 2019-10-15

[14] "UTF-8 Everywhere Manifesto".

[1]

[2]

[3]

[4]

[5]

[a]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

History

Utility of fixed width

Use

Variants

See also

Notes

References

External links