Extended Unix Code

EUC-JIS-2004
Alias(es)	EUC-JISx0213
Language(s)	Japanese, Ainu, English, Russian
Standard	JIS X 0213
Classification	Extended ASCII, variable-length encoding, CJK encoding, EUC
Extends	ASCII
Transforms / Encodes	JIS X 0213, JIS X 0201 (Kana)
Preceded by	EUC-JP
	v; t; e;

EUC-JP
	ISO 646:JP
Transforms / Encodes	JIS X 0208, JIS X 0212, JIS X 0201
Succeeded by	EUC-JISx0213
	v; t; e;

EUC-CN
	Simplified Chinese, English, Russian
Standard	GB 2312 (1980)
Classification	Extended ASCII, variable-length encoding, CJK encoding, EUC
Extends	ASCII
Extensions	748, GBK, GB 18030, x-mac-chinesesimp
Transforms / Encodes	GB 2312
Succeeded by	GBK, GB 18030
	v; t; e;

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

The most commonly used EUC codes are

EUC-TW

can take up to four bytes.

Modern applications are more likely to use

EUC-KR

for South Korea.

Encoding structure

The structure of EUC is based on the

space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes

.

EUC is a family of 8-bit profiles of ISO/IEC 2022, as opposed to 7-bit profiles such as

yen sign in EUC-JP (see below) and a won sign

in EUC-KR.

The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the

character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.^[1]

The EUC code itself does not make use of the announcement and designation sequences from ISO 2022.[1] However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows.^[1]

Individual sequence	Hexadecimal	Feature of EUC denoted
`ESC SP C`	`1B 20 43`	ISO-8 (8-bit, G0 in GL, G1 in GR)
`ESC SP Z`	`1B 20 5A`	G2 accessed using SS2
`ESC SP [`	`1B 20 5B`	G3 accessed using SS3
`ESC SP \`	`1B 20 5C`	Single-shifts invoke over GR

Fixed-length format

The ISO-2022-based variable-length encoding described above is sometimes referred to as the EUC packed format, which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:^[2]

Code set 0 as two bytes in the range 0x21–0x7E (except that the first may be 0x00).
Code set 1 as two bytes in the range 0xA0–0xFF (except that the first may be 0x80).
Code set 2 as a byte in the range 0x21–0x7E (or 0x00) followed by a byte in the range 0xA0–0xFF.
Code set 3 as a byte in the range 0xA0–0xFF (or 0x80) followed by a byte in the range 0x21–0x7E.

Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.^[2] These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange.

EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".^[3] Only the packed format is included in the WHATWG Encoding Standard used by HTML5.^[4]

EUC-CN

EUC-CN

USENET

.

An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE.

748 code

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is, therefore, more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

IBM code pages 1380, 1381, 1382 and 1383

IBM code page 1381 (CCSID 1381) comprises the single-byte code page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380),^[7] which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0.^[8]

IBM code page 1383 (CCSID 1383) comprises the single-byte code page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382),^[9] which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312.^[10] The alternative CCSID 5479^[11] is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes the IBM-selected and user-defined characters.^[12]

GBK and GB 18030

traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes

, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.

Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386.

The Unicode-based

Unicode transformation formats such as UTF-8

.

Mac OS Chinese Simplified

Other EUC-CN variants deviating from the EUC mechanism include the

trademark sign (™) and the ellipsis (…) respectively.^[6]

This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).

This use of 0xA0, 0xFD, 0xFE and 0xFF matches

Apple's Shift_JIS variant

.

Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8.

GB/T 12345 (the traditional Chinese variant of GB 2312),^[14] and both extensions are included by GB 18030 (the successor to GB 2312).^[15]

EUC-JP

EUC-JP is a variable-length encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.^[2] 0.1% of all web pages use EUC-JP since September 2022,^[16] while 3.0% of websites in Japanese use this encoding^[17] (less used than Shift JIS, or UTF-8). It is called Code page 954 by IBM.^[18]^[19] Microsoft has two code page numbers for this encoding (51932 and 20932).

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by

ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS

).

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes

Shift_JISx0213

, its Shift_JIS-based counterpart).

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (

MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX

). Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS the author uses.

Characters are encoded as follows:

As an EUC/
ISO 2022 compliant encoding, the C0 control characters
, space, and DEL are represented as in ASCII.

A graphical character from
Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of JIS X 0201.^[23]^[24]

A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of JIS X 0213 is encoded here, which is effectively a superset of standard JIS X 0208.[20]
A character from the upper half of JIS X 0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual JIS X 0201 representation in the range 0xA1 – 0xDF. This set may contain IBM vendor extensions in some variants.
A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1–0xFE, i.e. with the high bit set. In addition to standard JIS X 0212, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF.^[25]^[26] In EUC-JIS-2004, the second plane of JIS X 0213 is encoded here,^[20] which does not collide with the allocated rows in standard JIS X 0212.^[27] Some implementations of EUC-JIS-2004, such as the one used by Python, allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set.^[27]

Vendor extensions to EUC-JP (from, for example, the Open Software Foundation, IBM or NEC) were often allocated within the individual code sets,^[25]^[26] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).

However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.

DEC Kanji

Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0 is not required to be left-padded with null bytes (similarly to the packed format).^[28] JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212.^[28] In the basic "DEC Kanji" encoding, only the first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1.^[29]

The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets.^[28] It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters.^[29]

HP-16

Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which is a variant of Shift JIS. HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure:^[28]

Lead bytes 0xA1–C2, trail bytes 0x21–7E
Lead bytes 0xC3–E3, trail bytes 0x21–3F
Lead bytes 0xC3–E1, trail bytes 0x40–64

IKIS

The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with the box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters.^[28]^[29]

Adaptations of EUC-JP for EBCDIC

KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi,^[29] with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and the sequence 0x0A 0x42 switches to double-byte mode.^[b] However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space—0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters,^[28] and the remainder are used for corporate-defined characters, including both kanji and non-kanji.^[29]

JEF (Japanese-processing Extended Feature)

kuten purposes, although row 162 (lead byte 0x7E) is unused.^[28]^[29] Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji.^[29]

EUC-KR

EUC-KR
Mac OS Korean, IBM-949, Unified Hangul Code (Windows-949)
Transforms / Encodes	KS X 1001
Succeeded by	Unified Hangul Code (web standards)
v t e

EUC-KR is a

RFC 1557

dubbed it as EUC-KR.

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).

It is usually referred to as Wansung (

Republic of Korea. IBM refers to the double-byte component as Code page 971,^[34] and to EUC-KR with ASCII as Code page 970.^[35]^[36]^[37] It is implemented as Code page 20949 ("Korean Wansung")^[38]^[39] and Code page 51949 ("EUC Korean") by Microsoft.^[38]

As of April 2024 [update], less than 0.08% of all web pages globally use EUC-KR,^[40] but 4.6% of South Korean web pages use EUC-KR,^[41] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS.

As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.

Unified Hangul Code

A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu,^[42] or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261^[43] or 1363^[44] by IBM. IBM's code page 949 is a different, unrelated, EUC-KR extension.

Unified Hangul Code extends EUC-KR by using codes that do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in

W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR.^[45]

Mac OS Korean (HangulTalk)

Other encodings incorporating EUC-KR as a subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean),[13] which was used by HangulTalk (MacOS-KH), the Korean localization of the classic Mac OS. It was developed by Elex Computer (일렉스), who were at the time the authorised distributor of Apple Macintosh computers in South Korea.^[46]^[29]

HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within the EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylized

private-use character as a modifier for round-trip purposes, or to private-use characters.^[47]

Apple also uses certain single-byte codes outside of the EUC-KR plane for additional characters: 0x80 for a

copyright sign (©), 0x84 for a wide underscore (＿) and 0xFF for an ellipsis (…).^[47] Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above

), some are within the lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84).

EUC-KP

Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP.^[48] More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code.^[49]

EUC-TH

Although certain single-byte encodings such as the

TIS-620.^[50]

EUC-TW

EUC-TW is a

hanzi, while UTF-8

is becoming more common.

As an EUC/
ISO 2022 encoding, the C0 control characters
, ASCII space, and DEL are encoded as in ASCII.

A graphical character from ASCII (G0, code set 0) is encoded in GL as its usual single-byte representation (0x21–0x7E).

A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).

A character in planes 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:

The first byte is always 0x8E (Single Shift 2).
The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.
The third and fourth bytes are in GR (0xA1–0xFE).

Note that plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

Notes

ISO-2022-CN (with shift codes) and ISO-2022-JP-2
(without shift codes), both of which also support other non-ASCII sets.

^ These sequences match the hexadecimal forms shown by DEC^[30] and the decimal forms (10 65 and 10 66) listed by Lunde.^[28] Lunde lists the hexadecimal forms for both as 0xA0 0x42, seemingly in error.

References

^ ^a ^b ^c ^d IBM. "Character Data Representation Architecture (CDRA)". IBM. pp. 157–162.

^
ISBN 9780596800925
.

^ "Character Sets". IANA.

^ "4.2. Names and labels". Encoding Standard. WHATWG.

doi:10.17487/RFC1922. RFC 1922
. Informational. sec. 2.1: CN-GB).

^
Apple, Inc
.

^ "S-Ch PC Data mixed (IBM GB) including 1880 UDC, 31 IBM selected characters and 5 SAA SB characters". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-26.

^ "IBM Simplified Chinese Graphic Character Set" (PDF). IBM. 1993. C-H 3-3220-130 1993-11.

^ "CCSID 1383: S-Ch EUC G0 set, ASCII G1 set, GB 2312-80 set (1382)". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-28.

^ "IBM Simplified Chinese Graphic Character Set for Extended UNIX Code (EUC)" (PDF). IBM. 1994. C-H 3-3220-132 1994-06.

^ "CCSID 5479: S-Ch EUC G0 set, ASCII G1 set, GB 2312-80 set (5478)". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.

^ "CCSID 9574: S-Ch DBCS PC GB 2312-80 set, excluding 31 IBM selected and 1360 UDC. Also used in T-Ch 2022-CN TCP". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.

^ ^a ^b "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.

ISBN 9781565922242
.

^ Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.

^ "Historical trends in the usage of character encodings for websites". W3Techs.

^ "Distribution of Character Encodings among websites that use Japanese". w3techs.com. Retrieved 2023-11-01.

^ "CCSID 954 information document". Archived from the original on 2016-03-27.

^ International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03

^ ^a ^b ^c ^d "JIS X 0213 Code Mapping Tables". x0213.org.

^ "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.

^ "EUC-JP decoder". Encoding Standard. WHATWG. "If the byte is an ASCII byte, return a code point whose value is a byte."

^ "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

^ Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".

^ ^a ^b "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

^
ISBN 978-0-596-51447-1
.

^ ^a ^b Chang, Hyeshik (8 December 2021). "Readme for CJKCodecs". cPython. Python Software Foundation.

^
ISBN 978-0-596-51447-1
.

^
ISBN 978-0-596-51447-1
.

^ ^a ^b "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.^{[dead link]}

^ "KS X 1001:1992" (PDF).

ISO-IR
-149.

ISBN 978-0596514471
.

^ "IBM Globalization - Coded character set identifiers - CCSID 971". Archived from the original on 2014-11-30. Retrieved 2021-09-03.

^ "CCSID 970". IBM Globalization. IBM. Archived from the original on 2014-12-01.

^ "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.

^ International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03

^ ^a ^b "Code Page Identifiers". Windows Dev Center. Microsoft. 7 January 2021.

^ Julliard, Alexandre. "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project.

^ "Usage Statistics and Market Share of EUC-KR for Websites, April 2024". w3techs.com. Retrieved 2024-04-09.

^ "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2024-04-09.

^ "한글 코드에 대하여" (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.

^ In ucnv_lmb.cpp, a file originating from IBM and included in the International Components for Unicode source tree, the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of ULMBCS_GRP_KO, and is mapped to the "windows-949" ICU codec in the OptGroupByteToCPName array later in the file.

^ "Coded character set identifiers - CCSID 1363", IBM Globalization, IBM, archived from the original on 2014-11-29

^ "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG

^ Gil, Hojin. "HangulTalk: De facto standard Hangul environment for Mac". Guide to using Hangul on Macintosh.

^
Apple (2005-04-05). "Map (external version) from Mac OS Korean encoding to Unicode 3.2 and later". Unicode Consortium
.

^ Kim, Kyongsok (2002-11-30). "3-way cross-reference tables - KS X 1001, KPS 9566, and UCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 N2564. [Note: updated links for tables accompanying document: [1] [2]]

UTC
L2/18-011.

^ IBM (2001-05-07). "solaris-eucTH-2.7". icu-data. Unicode Consortium/International Components for Unicode.

External links

EUC-JP codeset table (minus the ASCII and half-width parts)

Code Page Identifiers

GB18030-2000 – The New Chinese National Standard (since updated to
GB18030
-2022, which is (slightly) incompatible)

The New Generation of Pre-Press Software in China – mentions the 748 code

Description of the EUC-TW code (in Chinese)

Manual page of EUC-JISX0213 in the Perl Encode module

International Register of Coded Character Sets to be Used With Escape Sequences – section 2.4 (p. 14f.) with the coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)

Chinese, Japanese, and Korean character set standards and encoding systems

v
t
e
Character encodings
Early telecommunications

Telegraph code
Needle

Morse
Non-Latin

Wabun/Kana

Chinese

Cyrillic

Korean

Baudot and Murray

Fieldata

ASCII
ISO/IEC 646

BCDIC

Teletex and Videotex/Teletext
T.51/ISO/IEC 6937

ITU T.61

ITU T.101

World System Teletext
background

sets

Transcode

ISO/IEC 8859

Approved parts
-1 (Western Europe)

-2 (Central Europe)

-3 (Maltese/Esperanto)

-4 (North Europe)

-5 (Cyrillic)

-6 (Arabic)

-7 (Greek)

-8 (Hebrew)

-9 (Turkish)

-10 (Nordic)

-11 (Thai)

-13 (Baltic)

-14 (Celtic)

-15 (New Western Europe)

-16 (Romanian)

Abandoned parts
-12 (Devanagari)

Proposed but not approved
KOI-8 Cyrillic

Sámi

Adaptations
Welsh

Barents Cyrillic

Estonian

Ukrainian Cyrillic

Bibliographic use

MARC-8
ANSEL

CCCII/EACC

ISO 5426

5426-2

5427

5428

6438

6862

National standards

ArmSCII

Big5

BraSCII

CNS 11643

DIN 66003

ELOT 927

GOST 10859

GB 2312

GB 12345

GB 12052

GB 18030

HKSCS

ISCII

JIS X 0201

JIS X 0208

JIS X 0212

JIS X 0213

KOI-7

KPS 9566

KS X 1001

KS X 1002

LST 1564

LST 1590-4

PASCII

Shift JIS

SI 960

TIS-620

TSCII

VISCII

VSCII

YUSCII

ISO/IEC 2022

ISO/IEC 8859

ISO/IEC 10367

Extended Unix Code / EUC

Mac OS Code pages
("scripts")

Armenian

Arabic

Barents Cyrillic

Celtic

Central European

Croatian

Cyrillic

Devanagari

Farsi (Persian)

Font X (Kermit)

Gaelic

Georgian

Greek

Gujarati

Gurmukhi

Hebrew

Iceland

Inuit

Keyboard

Latin (Kermit)

Maltese/Esperanto

Ogham

Roman

Romanian

Sámi

Turkish

Turkic Cyrillic

Ukrainian

VT100

DOS code pages

437

668

708

720

737

770

773

775

776

777

778

850

851

852

853

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

897

899

903

904

932

936

942

949

950

951

1034

1040

1042

1043

1044

1098

1115

1116

1117

1118

1127

3846

ABICOMP

CS Indic

CSX Indic

CSX+ Indic

CWI-2

Iran System

Kamenický

Mazovia

MIK

IBM AIX code pages

895

896

912

915

921

922

1006

1008

1009

1010

1012

1013

1014

1015

1016

1017

1018

1019

1046

1124

1133

Windows code pages

CER-GS

932

936 (GBK)

950

1169

Extended Latin-8

1250

1251

1252

1253

1254

1255

1256

1257

1258

1270

Cyrillic + Finnish

Cyrillic + French

Cyrillic + German

Polytonic Greek

EBCDIC code pages

Japanese language in EBCDIC

DKOI

DEC terminals (VTx)

Multinational (MCS)

National Replacement (NRCS)
French Canadian

Swiss

Spanish

United Kingdom

Dutch

Finnish

French

Norwegian and Danish

Swedish

Norwegian and Danish (alternative)

8-bit Greek

8-bit Turkish

SI 960

Hebrew

Special Graphics

Technical (TCS)

Platform specific

1052

1053

1054

1055

1056

1057

1058

Acorn RISC OS

Amstrad CPC

Apple II

ATASCII

Atari ST

BICS

Casio calculators

CDC

Compucolor 8001

Compucolor II

CP/M+

DEC RADIX 50

DEC MCS/NRCS

DG International

Galaksija

GEM

GSM 03.38

HP Roman

HP FOCAL

HP RPL

SQUOZE

LICS

LMBCS

MSX

NEC APC

NeXT

PETSCII

PostScript Standard

PostScript Latin 1

SAM Coupé

Sega SC-3000

Sharp calculators

Sharp MZ

Sinclair QL

Teletext

TI calculators

TRS-80

Ventura International

WISCII

XCCS

ZX80

ZX81

ZX Spectrum

Unicode / ISO/IEC 10646

UTF-1

UTF-7

UTF-8

UTF-16

UTF-32

UTF-EBCDIC

GB 18030

DIN 91379

BOCU-1

CESU-8

SCSU

TACE16

Comparison of Unicode encodings

TeX typesetting system

Cork

LY1

OML

OMS

OT1

Miscellaneous code pages

ABICOMP

ASMO 449

Digital encoding of APL symbols
ISO-IR-68

ARIB STD-B24

Fieldata

HZ

IEC-P27-1

INIS
7-bit

8-bit

ISO-IR-169

ISO 2033

KOI
KOI8-R

KOI8-RU

KOI8-U

Mojikyō

SEASCII

Stanford/ITS

Symbol

TRON

Unified Hangul Code

Control character

Morse prosigns

C0 and C1 control codes
ISO/IEC 6429

JIS X 0211

Unicode control, format and separator characters

Whitespace characters

Related topics

CCSID

Character encodings in HTML

Charset detection

Han unification

Hardware code page

MICR code

Mojibake

Variable-length encoding

Character sets

Retrieved from "https://en.wikipedia.org/w/index.php?title=Extended_Unix_Code&oldid=1230307360"

[7] ISO-2022-CN (with shift codes) and ISO-2022-JP-2
(without shift codes), both of which also support other non-ASCII sets.

[32] These sequences match the hexadecimal forms shown by DEC^[30] and the decimal forms (10 65 and 10 66) listed by Lunde.^[28] Lunde lists the hexadecimal forms for both as 0xA0 0x42, seemingly in error.

[cdra-1] IBM. "Character Data Representation Architecture (CDRA)". IBM. pp. 157–162.

[lunde-2] 
ISBN 9780596800925
.

[3] "Character Sets". IANA.

[4] "4.2. Names and labels". Encoding Standard. WHATWG.

[2-5] :10.17487/RFC1922. RFC 1922
. Informational. sec. 2.1: CN-GB).

[macsimchinese-6] 
Apple, Inc
.

[8] "S-Ch PC Data mixed (IBM GB) including 1880 UDC, 31 IBM selected characters and 5 SAA SB characters". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-26.

[9] "IBM Simplified Chinese Graphic Character Set" (PDF). IBM. 1993. C-H 3-3220-130 1993-11.

[10] "CCSID 1383: S-Ch EUC G0 set, ASCII G1 set, GB 2312-80 set (1382)". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-28.

[11] "IBM Simplified Chinese Graphic Character Set for Extended UNIX Code (EUC)" (PDF). IBM. 1994. C-H 3-3220-132 1994-06.

[12] "CCSID 5479: S-Ch EUC G0 set, ASCII G1 set, GB 2312-80 set (5478)". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.

[13] "CCSID 9574: S-Ch DBCS PC GB 2312-80 set, excluding 31 IBM selected and 1360 UDC. Also used in T-Ch 2022-CN TCP". IBM Globalization: Coded character set identifiers. IBM. Archived from the original on 2016-03-27.

[msdnlabels-14] "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.

[15] ISBN 9781565922242
.

[16] Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.

[17] "Historical trends in the usage of character encodings for websites". W3Techs.

[18] "Distribution of Character Encodings among websites that use Japanese". w3techs.com. Retrieved 2023-11-01.

[19] "CCSID 954 information document". Archived from the original on 2016-03-27.

[20] International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03

[x0213org-21] "JIS X 0213 Code Mapping Tables". x0213.org.

[22] "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.

[23] "EUC-JP decoder". Encoding Standard. WHATWG. "If the byte is an ASCII byte, return a code point whose value is a byte."

[24] "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

[25] Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".

[osfibmextensions-26] "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

[lundeJ-27] 
ISBN 978-0-596-51447-1
.

[hyeshik-28] Chang, Hyeshik (8 December 2021). "Readme for CJKCodecs". cPython. Python Software Foundation.

[lundeF-29] 
ISBN 978-0-596-51447-1
.

[lunde2009appE-30] 
ISBN 978-0-596-51447-1
.

[decunix-31] "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.^{[dead link]}

[33] "KS X 1001:1992" (PDF).

[34] ISO-IR
-149.

[35] ISBN 978-0596514471
.

[36] "IBM Globalization - Coded character set identifiers - CCSID 971". Archived from the original on 2014-11-30. Retrieved 2021-09-03.

[37] "CCSID 970". IBM Globalization. IBM. Archived from the original on 2014-12-01.

[38] "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.

[39] International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03

[winids-40] "Code Page Identifiers". Windows Dev Center. Microsoft. 7 January 2021.

[41] Julliard, Alexandre. "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project.

[42] "Usage Statistics and Market Share of EUC-KR for Websites, April 2024". w3techs.com. Retrieved 2024-04-09.

[43] "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2024-04-09.

[44] "한글 코드에 대하여" (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.

[45] In ucnv_lmb.cpp, a file originating from IBM and included in the International Components for Unicode source tree, the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of ULMBCS_GRP_KO, and is mapped to the "windows-949" ICU codec in the OptGroupByteToCPName array later in the file.

[46] "Coded character set identifiers - CCSID 1363", IBM Globalization, IBM, archived from the original on 2014-11-29

[47] "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG

[48] Gil, Hojin. "HangulTalk: De facto standard Hangul environment for Mac". Guide to using Hangul on Macintosh.

[mackoreantxt-49] 
Apple (2005-04-05). "Map (external version) from Mac OS Korean encoding to Unicode 3.2 and later". Unicode Consortium
.

[50] Kim, Kyongsok (2002-11-30). "3-way cross-reference tables - KS X 1001, KPS 9566, and UCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 N2564. [Note: updated links for tables accompanying document: [1] [2]]

[51] UTC
L2/18-011.

[52] IBM (2001-05-07). "solaris-eucTH-2.7". icu-data. Unicode Consortium/International Components for Unicode.

[1]

[2]

[3]

[4]

[7]

[8]

[9]

[10]

[11]

[12]

[6]

[14]

[15]

[16]

[17]

[18]

[19]

[23]

[24]

[25]

[26]

[20]

[27]

[28]

[29]

[b]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[30]

Encoding structure

Fixed-length format

EUC-CN

748 code

IBM code pages 1380, 1381, 1382 and 1383

GBK and GB 18030

Mac OS Chinese Simplified

EUC-JP

DEC Kanji

HP-16

IKIS

Adaptations of EUC-JP for EBCDIC

EUC-KR

Unified Hangul Code

Mac OS Korean (HangulTalk)

EUC-KP

EUC-TH

EUC-TW

See also

Notes

References

External links