Extended Unix Code

Extended Unix Code (EUC) is a variable length character encoding system used primarily for Japanese, Korean, and simplified Chinese.

The structure of EUC is based on the ISO 2022 standard, which specifies a way to represent character sets containing up to 94, 8836 (94×94) or 830584 (94×94×94) characters as sequences of 7-bit codes. Only ISO 2022–compliant character sets can have EUC forms. Up to four coded character sets (refered to as G0, G1, G2, and G3) can be represented with EUC scheme. G0 is almost always ISO 646 –compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).

To get the EUC form of an ISO 2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the ISO 2022 (EUC) code.

The most commonly-used EUC codes are variable length encoding with a character belonging to G0 (ISO 646-compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes wheras a single character in EUC-TW can take up to four bytes.

EUC-CN

EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET.

EUC-CN is also the usual way to use the GB18030 standard which includes traditional characters. Because GB18030 is backward-compatible with GB2312, and because modern Unicode-based operating systems allows users to (sometimes accidentally) use characters outside GB2312, nowadays it is more correct to say that EUC-CN is a form of GB18030. However, GB18030 in EUC-CN form is a variable-length code because GB18030 contains more than 8836 (94×94) characters.

Related encoding systems

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

EUC-JP

EUC-JP is a variable-length character encoding used to encode characters in three coded character sets, JIS X 0208, JIS X 0212 and JIS X 0201(the lower half of which is identical to US-ASCII except for Japanese Yen sign in place of backslash). A characters from the first character set is represented as two bytes in the octet range 0xA1-0xFE (G1 invoked on GR) and a character from the second takes three bytes to represent, the first of which is always 0x8F and the rest two octets are between 0xA1 and 0xFE (G3 invoked on GR). A character from the higher half of JIS X 0201 (half-width Katakana) is represented as two bytes with the first byte always being 0x8E (G2 invoked on GR) while a character from the lower half of JIS X 0201 or US-ASCII is represented as a single byte between 0x21 and 0x7E (G0 invoked on GL). (ISO-2022-JP is another way to use the same set of standards.)

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP.

In Japan, the encoding is heavily used by Unix or Unix-like operating systems except for HP-UX while Shift_JIS or its extensions (Windows codepage 936 and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depend on on what OS they're hosted.

EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS X 0213 take place of JIS X 0208 and JIS X 0212. There is a similar relationship between Shift_JIS and Shift_JISX0213.

EUC-KR

EUC-KR is a variable-length character encoding to represent Korean text using two coded character sets, KS X 1001 (formely KS C 5601) and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII takes one byte in GL (0x21-0x7E). It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been, albeit very slowly, decreasing as UTF-8 gains popularity, especially on Linux and Mac OS X. It is usually refered to as Wansung (완성) in South Korea. The default Korean codepage for Windows (codepage 949) is a proprietary, but upward compatible extension of EUC-KR refered to as Unified Hangul Code(통합 완성형, Tonghab Wansunghyeong). Mac Korean used in classic Mac OS is also compatible with EUC-KR.

EUC-TW

EUC-TW is a rarely used encoding for traditional Chinese characters as used on Taiwan. Big5 is much more common.

Unlike normal EUC codes (the EUC-CN form of GB2312, plus EUC-JP and EUC-KR), EUC-TW is a 4-byte code; this difference arises from the fact that CNS11643 contains more than 8836 (i.e., 94×94) characters. However, instead of using a three-byte code as one might expect, an encoded character in EUC-TW has the following peculiar structure:

Lead byte: 8E
Second byte: A0 + plane number
Third and fourth bytes: Two-byte EUC form of the code point in the specified plane

External links

EUC-JP A table of the non-ascii part of the codeset.
http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html — Information about GB18030
http://www.jagat.or.jp/asia/report/China3.htm mentions the 748 code
http://www.cns11643.gov.tw/web/word.jsp describes the EUC-TW code (in Chinese)
The manual page of EUC-JISX0213 in Perl Encode module
EUC-JP code range chart at Opengroup Japan