Talisman: UTF-8

(Partly from Silicon Graphics' Moving Worlds documentation)

The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard.

In order to allow standard ASCII text editors to contiue to work with most VRML files, we have chosen to support the UTF-8 encoding of ISO 10646. This encoding allows ASCII text (0x0..0x7F) to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes.

If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits will indicate the number of bytes following. There is always a 0 bit between the count bits and any data.

First byte could be one of the following. The X indicates bits available to encode the character.

                                 max
                                 char
  byte one  total in character   bits  possible numeric range
  --------  -------------------- ----  ----------------------------------
  0XXXXXXX  only this byte        7    0..0x7F (ASCII)
  110XXXXX  two bytes            11    Maximum character value is 0x7FF
  1110XXXX  three bytes          16    Maximum character value is 0xFFFF
  11110XXX  four bytes           21    Maximum character value is 0x1FFFFF
  111110XX  five bytes           26    Maximum character value is 0x3FFFFFF
  1111110X  six bytes            31    Maximum character value is 0x7FFFFFFF

All following bytes have this format: 10xxxxxx. For example, a three byte character encoding can hold 16 bits of actual character data, spread through the x bits in 1110xxxx 10xxxxxx 10xxxxxx. Further, we know that the next byte must not start with 10xxxxxx, if it's to be valid UTF-8.

Note that the Unicode UTF-8 standard does not, as of the time of this writing, include the high end of the possible range for the encoding described above.

A two byte example with ®

The symbol for a registered trademark is "circled R registered sign" or 174 in both ISO/Latin-1 (8859/1) and ISO 10646. In UTF-8 it has the following two-byte encoding: 0xC2, 0xAE. Here's a rough idea of how that's generated:

®, the registered trademark symbol, HTML &#174.
714 (decimal) / 0xAE (hexadecimal) / 10101110 (binary)
Since the binary values 8 bits amount to slightly than the 7 max for one-byte encoding, two bytes of UTF-8 will be needed.
The two-byte template is 110xxxxx and 10xxxxxx.
The 8 bits to store will have the left side padded with zeros to fill the 11 target bits, as 0001010110.
The first five bits, 00010, fit into byte 1's 110xxxxx, yielding 11000010.
The latter six bits, 101110, fit into byte 2's 10xxxxxx, yielding 10101110.
In the resulting 11000010 10101110, hex C2AE, looking at the first two bits of each byte tells us that byte 1 starts the sequence, and that byte 2 is a continuation byte. Further, the number of 1 bits at the beginning of byte 1 make it clear that byte 2 is the last byte for this character. If byte 2 were found by itself, we would know that it was not a complete character, and could skip forward or backword to look for one starting with binary 11... (a multibyte character), or 0... (an ASCII character), to begin valid interpretation, and that only a handful of bytes need be searched in either direction.

search	_TALISMAN ^general Information Unix server
		UTF-8