|
TALISMAN
general
Information
Unix server |
||
UTF-8 |
|
(From Silicon Graphics' Moving Worlds documentation) The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard. In order to allow standard ASCII text editors to contiue to work with most VRML files, we have chosen to support the UTF-8 encoding of ISO 10646. This encoding allows ASCII text (0x0..0x7F) to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes. If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits will indicate the number of bytes following. There is always a 0 bit between the count bits and any data. First byte could be one of the following. The X indicates bits available to encode the character. 0XXXXXXX only one byte 0..0x7F (ASCII) 110XXXXX two bytes Maximum character value is 0x7FF 1110XXXX three bytes Maximum character value is 0xFFFF 11110XXX four bytes Maximum character value is 0x1FFFFF 111110XX five bytes Maximum character value is 0x3FFFFFF 1111110X six bytes Maximum character value is 0x7FFFFFFF All following bytes have this format: 10XXXXXX A two byte example. The symbol for a registered trademark is "circled R registered sign" or 174 in both ISO/Latin-1 (8859/1) and ISO 10646. In hexadecimal, it is 0xAE. In HTML, it is ®. In UTF-8 it has the following two-byte encoding: 0xC2, 0xAE. |