Universal Coded Character Set

Universal Character Set (abbreviated UCS), is a character encoding defined in ISO 10646. UCS is kept synchronized character by character with Unicode. It has over a million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) are commonly used, the remainder being reserved for such purposes as representing ancient Egyptian hieroglyphics or rare Chinese characters. Some code points, even in the BMP, are not assigned to characters.

There are several encoding schemes defined by ISO 10646 for the Universal Character Set. The simplest is UCS-2, which uses two bytes for each character. UCS-2 permits every code point in the BMP, as long as it represents a character, to be represented by two bytes. Code points outside the BMP cannot be represented with UCS-2.

Another encoding is UCS-4, which uses four bytes for each character. This can represent every code point in the character set, including those outside the BMP. It has the advantage over UCS-2 of every encoded character being of the same length, which makes it simpler to manipulate; but it requires twice as much storage as UCS-2.

The most common encoding, UTF-16, is similar, but not identical, to UCS-2: UCS-2 precludes the use of D800-DFFF code value range, since the corresponding code points are not assigned to characters, while UTF-16 uses pairs of code values in that range to represent characters beyond the BMP. For example, D800 DC00 in UCS-2 are illegal code values, while the same sequence in UTF-16 is a surrogate pair representing the character at code point hex 10000.

Occasionally articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". This is not correct; UCS-2 is the 16-bit character encoding, and UCS-4 is the 32-bit character encoding; there is no UCS-16.

See ISO 10646, Unicode, UTF-8