Powered by
TTSReader
Share this page on
Article provided by Wikipedia


Extended Unix Code (EUC) is a multibyte "character encoding system used primarily for "Japanese, "Korean, and "simplified Chinese.

The structure of EUC is based on the "ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme.

G0 is almost always an "ISO-646 compliant coded character set such as "US-ASCII, ISO 646:KR (KS X 1003) or "ISO 646:JP (the lower half of JIS X 0201) that is invoked on GL (i.e. with the most significant bit cleared). An exception from US-ASCII is that 0x5C ("backslash in US-ASCII) is often used to represent a "Yen sign in EUC-JP (see below) and a "Won sign in EUC-KR.

To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a "character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.

The most commonly used EUC codes are "variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The "EUC-CN form of "GB2312 and "EUC-KR are examples of such two-byte EUC codes. "EUC-JP includes characters represented by up to three bytes whereas a single character in "EUC-TW can take up to four bytes.

Modern applications are more likely to use "UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors.

Contents

EUC-CN[edit]

EUC-CN
""EUCCN encoding.svg
MIME / IANA GB2312
Alias(es) csGB2312
Standard GB 2312 (1980)
Language(s) "Simplified Chinese, "English, "Russian
Classification "Extended ASCII, "Variable-width encoding, "CJK encoding, EUC
Extends "US-ASCII
Extensions 748, "GBK, "GB18030, x-mac-chinesesimp
Transforms / Encodes "GB 2312
Succeeded by "GBK, "GB18030

EUC-CN[1] is the usual way to use the "GB2312 standard for "simplified Chinese characters. Unlike the case of Japanese, the "ISO-2022 form of GB2312 is not normally used, though a variant form called "HZ was sometimes used on "USENET. An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes in the range 0xA1 – 0xFE.

Related encoding systems[edit]

748 code[edit]

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of "GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to "Big5 and other non–ISO 2022–compliant "DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

GBK and GB18030[edit]

"GBK is an extension to GB2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array of "CJK characters sourced largely from "Unicode 1.1, including "traditional Chinese characters and characters used only in "Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and "C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.

Variants of GBK are implemented by "Windows code page 936 (the "Microsoft Windows "code page for simplified Chinese), and by IBM's code page 1386.

The Unicode-based "GB18030 character encoding defines an extension of GBK capable of encoding the entirety of "Unicode. However, Unicode encoded as GB18030 is a "variable-width encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN which is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other "Unicode transformation formats such as "UTF-8.

Others[edit]

Other EUC-CN extensions include the "Mac OS Chinese Simplified script[1] (known as Code page 10008 or x-mac-chinesesimp).[2]

EUC-JP[edit]

EUC-JP
""EUC-JP.svg
MIME / IANA EUC-JP
Alias(es) Unixized JIS (UJIS), csEUCPkdFmtJapanese
Language(s) "Japanese, "English, "Russian
Classification "Extended "ISO 646, "Variable-width encoding, "CJK encoding, EUC
Extends "US-ASCII or "ISO 646:JP
Transforms / Encodes "JIS X 0208, "JIS X 0212, "JIS X 0201
Succeeded by EUC-JISx0213
EUC-JIS-2004
Alias(es) EUC-JISx0213
Standard JIS X 0213
Language(s) "Japanese, "Ainu, "English, "Russian
Classification "Extended ASCII, "Variable-width encoding, "CJK encoding, EUC
Extends "US-ASCII
Transforms / Encodes "JIS X 0213, "JIS X 0201 (Kana)
Preceded by EUC-JP

EUC-JP is a "variable-width encoding used to represent the elements of three "Japanese character set standards, namely "JIS X 0208, "JIS X 0212, and "JIS X 0201. 0.3% of all web pages use EUC-JP in January 2016.[3] Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.[4] It is called Code page 954 by IBM.

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by "ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike "Shift JIS).

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes "JIS X 0201 and "JIS X 0213[5] (similarly to "Shift_JISx0213, its Shift_JIS-based counterpart).

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions ("Windows code page 932 on "Microsoft Windows, and "MacJapanese on "classic Mac OS), although it became heavily used by "Unix or Unix-like "operating systems (except for "HP-UX). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.

Vendor extensions to EUC-JP were usually allocated within the individual code sets,[6] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR). Characters are encoded as follows:

EUC-KR[edit]

EUC-KR
""EUC-KR without extensions.svg
EUC-KR code structure
MIME / IANA EUC-KR
Alias(es) Wansung, IBM-970
Standard KS X 2901 (KS C 5861)
Language(s) "Korean, "English, "Russian
Classification "Extended "ISO 646, "Variable-width encoding, "CJK encoding, EUC
Extends "US-ASCII or "ISO 646:KR
Extensions Mac OS Korean, "IBM-949, "Unified Hangul Code (Windows-949)
Transforms / Encodes "KS X 1001
Succeeded by "Unified Hangul Code (web standards)

EUC-KR is a "variable-width encoding to represent Korean text using two coded character sets, "KS X 1001 (formerly KS C 5601)[12][13] and either "ISO 646:KR (KS X 1003, formerly KS C 5636) or "US-ASCII, depending on variant. KS X 2901 (formerly KS C 5861) stipulates the encoding and "RFC 1557 dubbed it as EUC-KR. When used with ASCII, it is called Code page 970 by IBM.[14][15]

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).

0.3% of all web pages use EUC-KR in April 2016.[3] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as "UTF-8 gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in Republic of Korea. The default Korean codepage for Windows, "code page 949 (IBM's 1363), is a proprietary but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.

As with most other encodings, "UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.

EUC-TW[edit]

EUC-TW is a "variable-width encoding that supports US-ASCII and 16 planes of "CNS 11643, each of which is 94x94. It is a rarely used encoding for "traditional Chinese characters as used in "Taiwan. "Big5 is much more common.

Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

"UTF-8 is becoming more common than EUC-TW, as with most code pages.

Packed versus fixed length form[edit]

The encodings described above (using bytes in 0x21-0x7E for code set 0, bytes in 0xA1-0xFE for code set 1, 0x8E followed by bytes in 0xA1-0xFE for code set 2 and 0x8F followed by bytes in 0xA1-0xFE for code set 3) are in a "variable-width form referred to as the EUC packed format. This is the form usually labelled as EUC.[4]

Internal processing may make use of a fixed-length alternative form called the EUC complete two-byte format. This represents:[4]

Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed length format.[4] These fixed length forms are suited to internal processing and are not usually encountered in interchange.

EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".[16] Only the packed format is included in the "WHATWG Encoding Standard used by "HTML5.[17]

See also[edit]

References[edit]

  1. ^ a b "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". "Apple, Inc. 
  2. ^ "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft. 
  3. ^ a b "Historical trends in the usage of character encodings for websites". W3Techs. 
  4. ^ a b c d Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. pp. 242–244. "ISBN "9780596800925. 
  5. ^ a b c d "JIS X 0213 Code Mapping Tables". x0213.org. 
  6. ^ a b "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. 
  7. ^ "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C. 
  8. ^ "EUC-JP decoder". Encoding Standard. WHATWG.  "If byte is an ASCII byte, return a code point whose value is byte."
  9. ^ "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. 
  10. ^ Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?". 
  11. ^ a b Chang, Hyeshik. "Readme for CJKCodecs". cPython. Python Software Foundation. 
  12. ^ "KS X 1001:1992" (PDF). 
  13. ^ "KS C 5601:1987" (PDF). 1988-10-01. 
  14. ^ "CCSID 970". IBM Globalization. IBM. 
  15. ^ "bm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode. 
  16. ^ "Character Sets". IANA. 
  17. ^ "4.2. Names and labels". Encoding Standard. WHATWG. 

External links[edit]

) ) WikipediaAudio is not affiliated with Wikipedia or the WikiMedia Foundation.