Node:Representation of EUC Wide Characters, Next:A Sample WCX Box, Previous:WCX Utility Functions, Up:Handling Wide Characters
As opposed to UNICODE, the definition of EUC specifies only the external
representation. The actual wide character codes assigned to the
multibyte characters are not specified. UNIX systems supporting EUC have
their own C data type, wchar_t
, which stores a wide character,
but the mapping between this type and the external representation is not
standardized.
We have decided to use a custom made mapping from the EUC encoding to
the character code set, as opposed to using the UNIX type
wchar_t
. This decision was made so that the code set is machine
independent and results in a compact representation of atoms.
EUC consists of four sub-code-sets, three of which can have multibyte external representation. Sub-code-set 0 consists of ASCII characters and is mapped one-to-one to codes 0..127. Sub-code-set 1 has an external representation of one to three bytes in the range 128-255, the length determined by the locale. Sub-code-sets 2 and 3 are similar, but their external representation is started by a so called single shift character code, known as SS2 and SS3, respectively. The following table shows the mapping from the EUC external encoding to SICStus Prolog character codes.
Sub- code-set External encoding Character code (binary) 0 0xxxxxxx 00000000 00000000 0xxxxxxx 1 1xxxxxxx 00000000 00000000 1xxxxxxx 1xxxxxxx 1yyyyyyy 00000000 xxxxxxx0 1yyyyyyy 1xxxxxxx 1yyyyyyy 1zzzzzzzz 0xxxxxxx yyyyyyy0 1zzzzzzz 2 SS2 1xxxxxxx 00000000 00000001 0xxxxxxx SS2 1xxxxxxx 1yyyyyyy 00000000 xxxxxxx1 0yyyyyyy SS2 1xxxxxxx 1yyyyyyy 1zzzzzzzz 0xxxxxxx yyyyyyy1 0zzzzzzz 3 SS3 1xxxxxxx 00000000 00000001 1xxxxxxx SS3 1xxxxxxx 1yyyyyyy 00000000 xxxxxxx1 1yyyyyyy SS3 1xxxxxxx 1yyyyyyy 1zzzzzzzz 0xxxxxxx yyyyyyy1 1zzzzzzz
For sub-code-sets other than 0, the sub-code-set length indicated by the
locale determines which of three mappings are used (but see below the
SP_CSETLEN
environment variable). When converting SICStus Prolog
character codes to EUC on output, we ignore bits that have no
significance in the mapping selected by the locale.
The byte lengths associated with the EUC sub-code-sets are determined by
using the csetlen()
function. If this function is not available
in the system configuration used, then Japanese Solaris lengths are
assumed, namely 2, 1, 2 for sub-code-sets 1, 2, and 3, respectively (the
lengths exclude the single shift character).
To allow experimentation with sub-code-sets differing from the locale,
the sub-code-set length values can be overridden by setting the
SP_CSETLEN
environment variable to xyz, where x,
y, and z are digits in the range 1..3. Such a setting will
cause the sub-code-sets 1, 2, 3 to have x, y, and z
associated with them as their byte lengths.