Shift JIS
MIME / IANA | Shift_JIS |
---|---|
Alias(es) | MS_Kanji, Windows-31J (web) |
Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts)[2][3] is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation[b] in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.
Shift JIS is based on character sets defined within
As of April 2024[update], 0.3% of surveyed web pages used Shift JIS (actually decoded as its superset Windows-31J encoding), a decline from 1.3% in July 2014.[4] Shift JIS is the second-most declared character encoding for Japanese websites, used by 5.2% of sites in the .jp domain, while UTF-8 is used by 94.8% of Japanese websites.[5][6]
Structure
Shift JIS is an extension of the single-byte encoding JIS X 0201:1997, that uses unassigned code points in JIS X 0201 to encode the double-byte JIS X 0208:1997 character set. The lead bytes for the double-byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF.
The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively (these deviations from ASCII align with JIS X 0201). The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201.
For double-byte characters, the first byte is always in the range 0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are unassigned in JIS X 0201). If the first byte is odd, the second byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if the first byte is even, the second byte must in the range 0x9F to 0xFC.
Shift JIS only guarantees that the first byte of two-byte characters will be high-bit-set (0x80–0xFF); the value of the second byte can be either high or low. The appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because the same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match the second byte of a character and the first byte of the next, which is not a valid Shift JIS character. String-searching algorithms must be tailor-made for Shift JIS.
Compatibility
Shift JIS is fully
Double-byte characters in JIS X 0208 need to be transformed in order to be encoded in Shift JIS. For a double-byte JIS X 0208 sequence ,[c] the transformation to the corresponding Shift JIS bytes is:
The competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a cleaner and more direct conversion to and from JIS X 0208 code points, as all high-bit-set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.
Usage
HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields (<
, >
, /
, "
, &
, ;
) are encoded as the same bytes as in ASCII, and those bytes do not appear in two-byte sequences.
Shift JIS can be used in
Multiple versions
Many different versions of Shift JIS exist. There are two areas for expansion:
Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift JIS itself.
Secondly, Shift JIS has more encoding space than is needed for JIS X 0201 and JIS X 0208 (see § Shift JIS byte map below), and this space can and is used for yet more characters (as either single-byte or double-byte characters).
Windows-932 / Windows-31J
The most popular extension is
Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the
Windows codepage 932 is the version used in the
MacJapanese
The version of Shift-JIS originating from the
However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a "
Shift_JISx0213 and Shift_JIS-2004
Alias(es) | Shift_JISx0213 |
---|---|
Language(s) | Japanese, Ainu, English, Russian |
Standard | JIS X 0213 |
Extends | Shift_JIS (1997), JIS X 0201 (8-bit) |
Transforms / Encodes | JIS X 0213 |
Preceded by | Shift_JIS (1997) |
The newer JIS X 0213 standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004. It is a superset of standard Shift JIS.[20]
In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints.[21]
In the above, is a two-byte Shift_JIS-2004 sequence, is the plane (面, men, surface) number (1 or 2), is the row (区, ku, ward) number (1-94) and is the cell (点, ten, point) number (1-94). The ku and ten numbers are equivalent to and respectively, where is a two-byte JIS sequence referencing a given plane.
The same set of characters can be represented by
Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see above). For example, compare plane 1 row 89 in JIS X 0213 (beginning 硃, 硎, 硏...)[22] to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈...).[23] In addition, some of the characters map to Unicode characters beyond the BMP.
Other variants
The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese
Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there is much scope for confusion, if the extensions are used.
A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an escape sequence. The best way of handling this is a special editor which encodes Shift JIS this way.
Shift JIS byte map
As defined in JIS X 0208:1997
The chart below gives the detailed meaning of each byte in a stream encoded in standard Shift JIS (conforming to JIS X 0208:1997).
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With vendor or JIS X 0213 extensions
Some of the bytes which are not used for single-byte codes or initial bytes in JIS X 0208:1997 are used by certain extensions, resulting in the layout detailed in the chart below.
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
See also
Footnotes
- ^ Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.
- ^ The ASCII Corporation should not be confused with the ASCII encoding used elsewhere in this article.
- ^ In JIS X 0208, j1 and j2 are each in the range 33 (0x21) to 126 (0x7e) inclusive (i.e., 7-bit character values excluding control characters (0–31 (0x1f) and 127 (0x7f)) and space).
References
- ^ a b c "Character Sets". IANA.
- ^ a b "convutf8.c". OpenSolaris. Line 305. 2008-11-12.
- ^ a b "Additional Japanese iconv Modules". What's New in the Solaris 9 9/04 Operating Environment. Oracle Corporation.
- ^ "Historical trends in the usage of character encodings for websites, April 2024". w3techs.com. Retrieved 2024-04-09.
- ^ "Distribution of Character Encodings among websites that use .jp". w3techs.com. Retrieved 2024-04-09.
- ^ "Distribution of Character Encodings among websites that use Japanese". w3techs.com. Retrieved 2024-04-09.
- ^ a b "Encoding.WindowsCodePage Property – .NET Framework (current version)". MSDN. Microsoft.
- ^ "Code Page Identifiers". Windows Dev Center. Microsoft. 7 January 2021.
- ^ "IBM-943 and IBM-932". IBM Knowledge Center. IBM.
- ^ "CP932.TXT". Unicode Consortium.
- ^ "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03.
- ^ Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".
- ^ Kaplan, Michael S (2007-05-26). "The PUA outside of Unicode". Sorting it all out.
- ^ "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
- ^ "4.2. Names and labels". Encoding Standard. WHATWG.
- ^ a b c "JAPANESE.TXT: Map (external version) from Mac OS Japanese encoding to Unicode 2.1 and later". Apple Computer, Inc.; Unicode Consortium.
- Adobe Inc.
- ^ "Encoding Variants for MacJapanese". Apple Developer Documentation. Apple.
- ISBN 9780596514471.
- ^ "JIS X 0213 Code Mapping Tables". x0213.org.
- ^ "JIS X 0213の代表的な符号化方式 § Shift_JIS-2004" (in Japanese). Hexadecimal numbers in the source have been converted to decimal for display.
- ISO-IR-233.
- ^ "Index jis0208 visualization". Encoding Standard. WHATWG.
- ^ "Original Emoji from DoCoMo". FileFormat.info.
- ^ "Original Emoji from KDDI". FileFormat.info.
External links
- Shift-JIS Kanji Table – a table of the non-ASCII part of the codeset
- "Windows Codepage 932". Microsoft. May 1, 2005. Archived from the original on 2008-03-07. – Microsoft's definition
- Forms of Shift-JIS in ICU (International Components for Unicode)