Jump to content

UTF-8

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Sunny256 (talk | contribs) at 21:10, 14 March 2004. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Template:Table Unicode

UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-length character encoding. It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit mail systems.

It uses a set of 1 to 4 bytes depending on the Unicode symbol. For example, only one UTF-8 byte is needed to encode the 128 US-ASCII characters in the Unicode range U+0000 to U+007F.

While it may seem inefficient to represent Unicode characters with as many as 4 bytes, UTF-8 allows legacy systems to transmit this ASCII superset.

The IETF requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding.

Description

UTF-8 is currently standardized as RFC 3629 (UTF-8, a transformation format of ISO 10646).

In summary, a Unicode character's bits are divided into several groups, which are then divided among the lower bit positions inside the UTF-8 bytes.

Characters smaller than 128dec are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters.

In other cases, up to 4 bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters. Particularly characters lower than 32dec traditionally called control characters, e.g. carriage return).

Code range
hexadecimal
UTF-16 UTF-8
binary
Notes
000000 - 00007F 00000000 0xxxxxxx 0xxxxxxx ASCII equivalence range; byte begins with zero
000080 - 0007FF 00000xxx xxxxxxxx 110xxxxx 10xxxxxx first byte begins with 11, the following byte(s) begin with 10
000800 - 00FFFF xxxxxxxx xxxxxxxx 1110xxxx 10xxxxxx 10xxxxxx
010000 - 10FFFF 110110xx xxxxxxxx
110111xx xxxxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8

For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

  • It falls into the range of 0x0080 to 0x07FF. The table shows it will be encoded using 2 bytes, 110xxxxx 10xxxxxx.
  • Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
  • The 11 bits are put in their order into the position marked by "x"-s: 11010111 10010000.
  • The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That's the letter aleph in UTF-8.

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)

In fact, UTF-8 is able to use a sequence of up to six bytes and cover the whole area 0x00-0x7FFFFFFF (31 bits), but UTF-8 was in november 2003 restricted by RFC 3629 to only use the area covered by the formal Unicode definition, 0x00-0x10FFFF. Before this, only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded text. After this limit was introduced, the number of unused bytes in a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Even though this new definitition limits the available encoding area severely, the problem with overlong sequences (different ways of encoding the same character, which can be security risk) is eliminated, because an overlong sequence will contain some of these bytes that is not used and therefore will not be a valid sequence.

Advantages

  • Some Unicode symbols (including the Latin alphabet) in UTF-8 will take as little as 1 byte, although others may take up to 4. So UTF-8 will generally save space compared to UTF-16 (the simplest encoding of Unicode) in text where 7-bit ASCII characters are common.
  • Most existing computer software (including operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In UTF-16-encoded Unicode the English letter "A" will be coded as 0x0041. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that encoded bytes never take on any of the special ASCII 'special character' values, preventing this and similar problems.
  • UTF-8 strings can be sorted using standard byte-oriented sorting routines (however there will be no differentiation between stroke and capital letters with values exceeding 128).
  • Since the top bit is always set in all the bytes of multiple-byte UTF-8 characters, most software designed to process ASCII or other 8-bit codes will never see any of them as a space, so white-space based tokenizing routines will continue to work correctly with UTF-8 encoded strings.
  • Although encoded characters are variable length, their encoding is such that their boundaries can be delineated without elaborate parsing.
  • UTF-8 is the default value for the XML format.

Disadvantages

  • UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user.
  • A badly-written (and non-standard-compliant) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its 8-bit representation.
  • Ideographs use 3 bytes in UTF-8, but only 2 in UTF-16. So Chinese/Japanese/Korean text will take up more space when represented in UTF-8.

History

UTF-8 was invented by Ken Thompson on September 2, 1992 in New Jersey for use on the Plan 9 operating system. UTF-8 was first officially presented on the USENIX conference in San Diego January 25-29 1993.


to be merged

UTF-8

Another common encoding is UTF-8, which is also a variable-length encoding. The first 128 code points are represented by one byte, and are equivalent to ASCII. Representation of higher code points requires two to six bytes.

UTF-8 has several advantages, especially when adding support for Unicode to existing software. For one, no changes are required for supporting ASCII only. Secondly, most functions from the standard library of the C programming language that have traditionally been used for character processing (such as strcmp for comparisons and trivial sorting) still work, because they operate on 8-bit values. (By contrast, to support the 16- or 32-bit encodings mentioned above, large parts of older software would have to be rewritten.) Third, for most texts that use relatively few non-ASCII characters (that is, texts in most Western languages), the encoding is very space-efficient because it will require only slightly more than 8 bits per character.

The exact mechanics of UTF-8 are as follows (numbers prefixed with 0x are in hexadecimal notation):

  • For a scalar value less than 0x80, use one byte with the same scalar value.
  • For a scalar value less than 0x800, use two bytes, where the first is 0xC0 plus the number represented by the 7th-11th least significant bits, while the second is 0x80 plus the 1st-6th least significant bits.
  • For a scalar value less than 0x10000, use three bytes. The first is 0xE0 plus the 13th-16th LSBs, the second 0x80 plus the 7th-12th LSBs, and the third 0x80 plus the 1st-6th LSBs.
  • For a scalar value less than 0x200000, use four bytes, namely 0xF0 plus the 19th-21st LSBs; 0x80 plus the 13th-18th; 0x80 plus the 7th-12th; and 0x80 plus the 1st-6th LSBs.

Currently no other sequences are legal because no scalar values above 0x200000 have been assigned Unicode characters yet; however, sequences of up to six bytes will be legal once these code points will be assigned.

As a consequence of the above details, the following properties of multi-byte sequences hold:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, etc.
  • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.

UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although it is argued that this property adds redundancy to UTF-8-encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently.