Representation of EUC Wide Characters

12.9 Representation of EUC Wide Characters

As opposed to UNICODE, the definition of EUC specifies only the external representation. The actual wide character codes assigned to the multibyte characters are not specified. UNIX systems supporting EUC have their own C data type, wchar_t, which stores a wide character, but the mapping between this type and the external representation is not standardized.

We have decided to use a custom made mapping from the EUC encoding to the character code set, as opposed to using the UNIX type wchar_t. This decision was made so that the code set is machine independent and results in a compact representation of atoms.

EUC consists of four sub-code-sets, three of which can have multibyte external representation. Sub-code-set 0 consists of ASCII characters and is mapped one-to-one to codes 0..127. Sub-code-set 1 has an external representation of one to three bytes in the range 128-255, the length determined by the locale. Sub-code-sets 2 and 3 are similar, but their external representation is started by a so called single shift character code, known as SS2 and SS3, respectively. The following table shows the mapping from the EUC external encoding to SICStus Prolog character codes.

     Sub-
     code-set  External encoding                 Character code (binary)
     
      0        0xxxxxxx                          00000000 00000000 0xxxxxxx
     
      1        1xxxxxxx                          00000000 00000000 1xxxxxxx
               1xxxxxxx 1yyyyyyy                 00000000 xxxxxxx0 1yyyyyyy
               1xxxxxxx 1yyyyyyy 1zzzzzzzz       0xxxxxxx yyyyyyy0 1zzzzzzz
     
      2        SS2 1xxxxxxx                      00000000 00000001 0xxxxxxx
               SS2 1xxxxxxx 1yyyyyyy             00000000 xxxxxxx1 0yyyyyyy
               SS2 1xxxxxxx 1yyyyyyy 1zzzzzzzz   0xxxxxxx yyyyyyy1 0zzzzzzz
     
      3        SS3 1xxxxxxx                      00000000 00000001 1xxxxxxx
               SS3 1xxxxxxx 1yyyyyyy             00000000 xxxxxxx1 1yyyyyyy
               SS3 1xxxxxxx 1yyyyyyy 1zzzzzzzz   0xxxxxxx yyyyyyy1 1zzzzzzz

For sub-code-sets other than 0, the sub-code-set length indicated by the locale determines which of three mappings are used (but see below the SP_CSETLEN environment variable). When converting SICStus Prolog character codes to EUC on output, we ignore bits that have no significance in the mapping selected by the locale.

The byte lengths associated with the EUC sub-code-sets are determined by using the csetlen() function. If this function is not available in the system configuration used, then Japanese Solaris lengths are assumed, namely 2, 1, 2 for sub-code-sets 1, 2, and 3, respectively (the lengths exclude the single shift character).

To allow experimentation with sub-code-sets differing from the locale, the sub-code-set length values can be overridden by setting the SP_CSETLEN environment variable to xyz, where x, y, and z are digits in the range 1..3. Such a setting will cause the sub-code-sets 1, 2, 3 to have x, y, and z associated with them as their byte lengths.