Unicode control characters

Many

NULL) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string

(as opposed to a starting address and a length), since the string ends once the program reads the null character.

In the narrowest sense, a control code is a character with the

ISO/IEC 6429. Control codes are handled distinctly from ordinary Unicode characters, for example, by not being assigned character names (although they are assigned normative formal aliases).^[1] In a broader sense, other non-printing format characters, such as those used in bidirectional text, are also referred to as control characters by software;^[2]

these are mostly assigned to the general category Cf (format), used for format effectors introduced and defined by Unicode itself.

Category "Cc" control codes (C0 and C1)

The control code ranges 0x00–0x1F ("C0") and 0x7F originate from the 1967 edition of

ISO/IEC 6429

(ECMA-48).

The

second blocks (comprising U+0000 through U+00FF) from ASCII and ISO/IEC 8859-1, thus incorporating the C0 and C1 control code ranges (U+0000–U+001F, U+007F–U+009F) as general category "Cc". It does not assign normative names to these control codes, though it does assign them normative aliases.^[1]

Category "Cc" control codes can serve a variety of purposes, not limited to format effectors: for example, the default ASCII C0 set includes six format effectors (

FF and CR), ten transmission controls, four device controls, four information separators and eight other control codes.^[4] Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by terminal emulators. Certain characters are commonly used for formatting or sentinel

purposes:

U+0000 NULL (used in null-terminated strings)
U+0009 HORIZONTAL TABULATION (HT) (inserted by the tab key)
U+000A LINE FEED (LF) (used as a line break)
U+000C FORM FEED (FF) (denotes a page break in a plain text file)
U+000D CARRIAGE RETURN (CR) (used in some line-breaking conventions)
U+0085 NEXT LINE (NEL) (sometimes used as a line break in text transcoded from EBCDIC)

Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085 (the ASCII format effectors except for BS, plus the ASCII information separators and the C1 NEL). The rest of the "Cc" control codes are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.^[5] Furthermore, certain specialised higher-level protocols, such as transcoded Teletext, may include a different interpretation of the entire C0 control code range.^[6]

Unicode introduced separators

In an attempt to simplify the several newline characters used in legacy text^{[citation needed]}, Unicode introduces its own newline characters to separate either lines or paragraphs: U+2028 LINE SEPARATOR (abbreviated LS or LSEP) and U+2029 PARAGRAPH SEPARATOR (abbreviated PS or PSEP).

Like CR and LF, LS and PS are effectors for text formatting; unlike CR and LF, they are not treated as "control codes" for

ECMA-48 purposes (category Cc), rather having semantics defined entirely by Unicode itself. They are assigned to sui generis Unicode categories Zl and Zp respectively, under the major category Z (separator) used for certain whitespace characters

.

Language tags

Unicode previously included 128 characters, now deprecated, for language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to

BCP 47

. For example, to indicate subsequent text as the variant of English as written in the United States, the sequence U+E0001 LANGUAGE TAG, U+E0065 TAG LATIN SMALL LETTER E, U+E006E TAG LATIN SMALL LETTER N, U+E002D TAG HYPHEN-MINUS, U+E0075 TAG LATIN SMALL LETTER U and U+E0073 TAG LATIN SMALL LETTER S would have been used.

These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in.

The tag characters U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG were deprecated in Unicode 5.1 (2008) and should not be used for language information.^[7] The characters U+E0020—U+E0073 were also deprecated, but were restored with the release of Unicode 8.0 (2015). The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags".^[8] Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.^[8]

Interlinear annotation

Three formatting characters provide support for

interlinear annotation (U+FFF9 INTERLINEAR ANNOTATION ANCHOR, U+FFFA INTERLINEAR ANNOTATION SEPARATOR, U+FFFB INTERLINEAR ANNOTATION TERMINATOR). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C Ruby markup

recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.

Bidirectional text control

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسم الله”) (translated into English as "Bismillah") right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.

However, directionality may not be detected correctly if left-to-right text is quoted at the beginning of a right-to-left paragraph (or vice versa),^[2] and the support for bidirectional text becomes even more complicated when text flowing in opposite directions is embedded hierarchically, for example if an English text quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides twelve characters to help control these embedded bidirectional text levels up to 125 levels deep:^[9]

U+061C ؜ ARABIC LETTER MARK
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+202A LEFT-TO-RIGHT EMBEDDING
U+202B RIGHT-TO-LEFT EMBEDDING
U+202C POP DIRECTIONAL FORMATTING
U+202D LEFT-TO-RIGHT OVERRIDE
U+202E RIGHT-TO-LEFT OVERRIDE
U+2066 ⁦ LEFT-TO-RIGHT ISOLATE
U+2067 ⁧ RIGHT-TO-LEFT ISOLATE
U+2068 ⁨ FIRST STRONG ISOLATE
U+2069 ⁩ POP DIRECTIONAL ISOLATE

Variation selectors

Many characters map to alternate glyphs depending on the context. For example, Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.

However, for other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant. As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character.

Control pictures

Unicode provides graphic characters for representing

C1 control codes

.

Control Pictures^[1]^[2] Official Unicode Consortium code chart (PDF)
	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
U+240x	␀	␁	␂	␃	␄	␅	␆	␇	␈	␉	␊	␋	␌	␍	␎	␏
U+241x	␐	␑	␒	␓	␔	␕	␖	␗	␘	␙	␚	␛	␜	␝	␞	␟
U+242x	␠	␡	␢	␣	␤	␥	␦
U+243x
1.^ As of Unicode version 15.1 2.^ Grey areas indicate non-assigned code points

References

^ ^a ^b "Name Aliases". Unicode Character Database. Unicode Consortium.
^ ^a ^b Segan, Danilo. "Towards a localised desktop". For some cases where automatic decision making doesn't work, you can manually add specific direction markers by right-clicking the text field, choosing "Insert Unicode control character" from the menu, and selecting appropriate direction mark. This would allow you, for instance, to start your RTL text with an otherwise LTR word (such as "GNOME").
ISO/IEC FDIS 8859-1:1998; JTC1/SC2/N2988; WG3/N411. This set of coded graphic characters may be regarded as a version of an 8-bit code according to ISO/IEC 2022 or ISO/IEC 4873 at level 1. […] The shaded positions in the code table correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429.{{citation}}: CS1 maint: numeric names: authors list (link
)

ISO-IR-1.{{citation}}: CS1 maint: numeric names: authors list (link
)

ISBN 978-1-936213-22-1. {{cite book}}: |work= ignored (help
)

Unicode Technical Committee] and Script Ad Hoc who provided the guidance to the group writing the Symbols for Legacy Computing
proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.

doi:10.17487/RFC6082. {{cite journal}}: Cite journal requires |journal= (help
)

^ ^a ^b "Unicode 8.0.0, Implications for Migration". Unicode Consortium.

^ "UAX #9: Unicode Bidirectional Algorithm". Unicode Consortium. 2018-05-09.

v
t
e
Unicode
Unicode

Unicode Consortium

ISO/IEC 10646 (Universal Character Set)

Versions

Code points

Block
List

Universal Character Set

Character charts

Character property

Plane

Private Use Area

Characters
Special purpose

BOM

Combining grapheme joiner

Left-to-right mark / Right-to-left mark

Soft hyphen

Variant form

Word joiner

Zero-width joiner

Zero-width non-joiner

Zero-width space

Lists

Characters

CJK Unified Ideographs

Combining character

Duplicate characters

Numerals

Scripts

Spaces

Symbols

Halfwidth and fullwidth

Alias names and abbreviations

Whitespace characters

Processing
Algorithms

Bidirectional text

Collation
ISO/IEC 14651

Equivalence

Variation sequences

International Ideographs Core

Comparison of encodings

BOCU-1

CESU-8

Punycode

SCSU

UTF-1

UTF-7

UTF-8

UTF-16/UCS-2

UTF-32/UCS-4

UTF-EBCDIC

On pairs of
code points

Combining character

Compatibility characters

Duplicate characters

Equivalence

Homoglyph

Precomposed character
list

Z-variant

Variation sequences

Regional indicator symbol

Emoji skin color

Usage

Domain names (IDN)

Email

Fonts

HTML
entity references

numeric references

Input

International Ideographs Core

Related standards

Common Locale Data Repository (CLDR)

GB 18030

ISO/IEC 8859

ISO 15924

Related topics

Anomalies

ConScript Unicode Registry

Ideographic Research Group

International Components for Unicode

People involved with Unicode

Han unification

Scripts and symbols in Unicode
Common and
inherited scripts

Combining marks

Diacritics

Punctuation marks

Spaces

Numbers

Modern scripts

Adlam

Arabic

Armenian

Balinese

Bamum

Batak

Bengali

Bopomofo

Braille

Buhid

Burmese

Canadian Aboriginal

Chakma

Cham

Cherokee

CJK Unified Ideographs (Han)

Cyrillic

Deseret

Devanagari

Geʽez

Georgian

Greek

Gujarati

Gunjala Gondi

Gurmukhi

Hangul

Hanifi Rohingya

Hanja

Hanunuoo

Hebrew

Hiragana

Javanese

Kanji

Kannada

Katakana

Kayah Li

Khmer

Lao

Latin

Lepcha

Limbu

Lisu (Fraser)

Lontara

Malayalam

Masaram Gondi

Mende Kikakui

Medefaidrin

Miao (Pollard)

Mongolian

Mru

N'Ko

Nag Mundari

New Tai Lue

Nüshu

Nyiakeng Puachue Hmong

Odia

Ol Chiki

Osage

Osmanya

Pahawh Hmong

Pau Cin Hau

Pracalit (Newa)

Ranjana

Rejang

Samaritan

Saurashtra

Shavian

Sinhala

Sorang Sompeng

Sundanese

Syriac

Tagbanwa

Tai Le

Tai Tham

Tai Viet

Tamil

Tangsa

Telugu

Thaana

Thai

Tibetan

Tifinagh

Tirhuta

Toto

Vai

Wancho

Warang Citi

Yi

Ancient and
historic scripts

Ahom

Anatolian hieroglyphs

Ancient North Arabian

Avestan

Bassa Vah

Bhaiksuki

Brāhmī

Carian

Caucasian Albanian

Coptic

Cuneiform

Cypriot

Cypro-Minoan

Dives Akuru

Dogra

Egyptian hieroglyphs

Elbasan

Elymaic

Glagolitic

Gothic

Grantha

Hatran

Imperial Aramaic

Inscriptional Pahlavi

Inscriptional Parthian

Kaithi

Kawi

Kharosthi

Khitan small script

Khojki

Khudawadi

Khwarezmian (Chorasmian)

Linear A

Linear B

Lycian

Lydian

Mahajani

Makasar

Mandaic

Manichaean

Marchen

Meetei Mayek

Meroitic

Modi

Multani

Nabataean

Nandinagari

Ogham

Old Hungarian

Old Italic

Old Permic

Old Persian cuneiform

Old Sogdian

Old Turkic

Old Uyghur

Palmyrene

ʼPhags-pa

Phoenician

Psalter Pahlavi

Runic

Sharada

Siddham

Sogdian

South Arabian

Soyombo

Sylheti Nagri

Tagalog (Baybayin)

Takri

Tangut

Ugaritic

Vithkuqi

Yezidi

Zanabazar Square

Notational scripts

Duployan

SignWriting

Symbols, emojis

Cultural, political, and religious symbols

Currency

Control Pictures

Mathematical operators and symbols
List by subject

Phonetic symbols (including IPA)

Emoji

Category: Unicode

Category: Unicode blocks

Retrieved from "https://en.wikipedia.org/w/index.php?title=Unicode_control_characters&oldid=1181769041"

[aliases-1] "Name Aliases". Unicode Character Database. Unicode Consortium.

[segan-2] Segan, Danilo. "Towards a localised desktop". For some cases where automatic decision making doesn't work, you can manually add specific direction markers by right-clicking the text field, choosing "Insert Unicode control character" from the menu, and selecting appropriate direction mark. This would allow you, for instance, to start your RTL text with an otherwise LTR word (such as "GNOME").

[3] ISO/IEC FDIS 8859-1:1998; JTC1/SC2/N2988; WG3/N411. This set of coded graphic characters may be regarded as a version of an 8-bit code according to ISO/IEC 2022 or ISO/IEC 4873 at level 1. […] The shaded positions in the code table correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429.{{citation}}: CS1 maint: numeric names: authors list (link
)

[ir001-4] ISO-IR-1.{{citation}}: CS1 maint: numeric names: authors list (link
)

[unicode-23-1-5] ISBN 978-1-936213-22-1. {{cite book}}: |work= ignored (help
)

[6] Unicode Technical Committee] and Script Ad Hoc who provided the guidance to the group writing the Symbols for Legacy Computing
proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.

[7] :10.17487/RFC6082. {{cite journal}}: Cite journal requires |journal= (help
)

[migration-8] "Unicode 8.0.0, Implications for Migration". Unicode Consortium.

[9] "UAX #9: Unicode Bidirectional Algorithm". Unicode Consortium. 2018-05-09.

[1]

[2]

[4]

[5]

[6]

[7]

[8]

[9]

[1]

[2]