Buy
Our
Stuff!

TALISMAN general Information
Unix server

UTF-8

(From Silicon Graphics' Moving Worlds documentation)


The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard.

In order to allow standard ASCII text editors to contiue to work with most VRML files, we have chosen to support the UTF-8 encoding of ISO 10646. This encoding allows ASCII text (0x0..0x7F) to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes.

If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits will indicate the number of bytes following. There is always a 0 bit between the count bits and any data.

First byte could be one of the following. The X indicates bits available to encode the character.

  0XXXXXXX  only one byte        0..0x7F (ASCII)
  110XXXXX  two bytes            Maximum character value is 0x7FF 
  1110XXXX  three bytes          Maximum character value is 0xFFFF
  11110XXX  four bytes           Maximum character value is 0x1FFFFF
  111110XX  five bytes           Maximum character value is 0x3FFFFFF
  1111110X  six bytes            Maximum character value is 0x7FFFFFFF

All following bytes have this format: 10XXXXXX

A two byte example. The symbol for a registered trademark is "circled R registered sign" or 174 in both ISO/Latin-1 (8859/1) and ISO 10646. In hexadecimal, it is 0xAE. In HTML, it is ®. In UTF-8 it has the following two-byte encoding: 0xC2, 0xAE.

Send comments to www@talisman.org