In UTF-8, each 16-bit Unicode character is encoded as a sequence of one, two, or three 8-bit bytes, depending on the value of the character. The following table shows the format of such UTF-8 byte sequences (where the “free bits” shown by x's in the table are combined in the order shown, and interpreted from most significant to least significant)

 Binary format of bytes in sequence:
                                        Number of    Maximum expressible
 1st byte     2nd byte    3rd byte      free bits:      Unicode value:

 0xxxxxxx                                  7           007F hex   (127)
 110xxxxx     10xxxxxx                  (5+6)=11       07FF hex  (2047)
 1110xxxx     10xxxxxx    10xxxxxx     (4+6+6)=16      FFFF hex (65535)

The value of each individual byte indicates its UTF-8 function, as follows:

 00 to 7F hex   (0 to 127):  first and only byte of a sequence.
 80 to BF hex (128 to 191):  continuing byte in a multi-byte sequence.
 C2 to DF hex (194 to 223):  first byte of a two-byte sequence.
 E0 to EF hex (224 to 239):  first byte of a three-byte sequence.

Other byte values are either not used when encoding 16-bit Unicode characters (i.e. F0 to F4 hex), or are not part of any well-formed Unicode UTF-8 sequence (i.e. C0, C1, and F5 to FF hex);

see the links to UTF-8 standards documents below for further details.

  • utf8.txt
  • Last modified: 2007-07-11 17:28
  • by