A Sample Wide Character Extension (WCX) Box

This example implements a WCX box supporting the use of four external encodings within the same SICStus Prolog invocation: ISO Latin1, ISO Latin2 (ISO 8859/2), UNICODE, and EUC. The code is included in the distribution as library(wcx_example).

The default encoding functions supplied in SICStus Prolog deal with a single encoding only. However, the interface does allow the implementation of WCX boxes supporting different encodings for different streams.

A basic assumption in SICStus Prolog is that there is a single character set. If we are to support multiple encodings we have to map them into a single character set. For example, the single-byte character sets ISO Latin1 and ISO Latin2 can be easily mapped to the Unicode character set. On the other hand there does not seem to be a simple mapping of the whole of EUC character set to UNICODE or the other way round.

Therefore, in this example, we use a composite character set, which covers both EUC and Unicode, but does not deal with unifying the character codes of characters appearing in both character sets, except for the case of ASCII characters.

The figure below depicts the structure of the composite character set of the sample WCX box.

     .------------------.
     |  EUC             |
     |                  |
     |                  |
     |        .+++++++++++++++++++++++++++.
     |        +  ASCII  *  LATIN1  |      +
     .--------+=========*==========       +
              + LATIN2  *                 +
              +**********                 +
              +                           +
              +                           +
              +                 UNICODE   +
              .+++++++++++++++++++++++++++.
     

This character code set uses character codes up to 24 bit wide:

0 =< code =< 2^16-1
A UNICODE character with the given code, including ASCII.
code = 2^16 + euc_code
A non-ASCII EUC character with code euc_code (as described in Representation of EUC Wide Characters).

The four external encodings supported by the sample WCX box can be specified on a stream-by-stream basis, by supplying a wcx(ENC) option to open/4, where ENC is one of the atoms latin1, latin2, unicode or euc.

The mapping of these external encodings to the composite character code set is done in the following way:

latin1
is mapped one-to-one to UNICODE codes 0x0..0xff
latin2
is mapped to UNICODE codes 0x0..0x02dd, using an appropriate conversion table for the non-ASCII part.
unicode
assumes UTF-8 external encoding and maps one-to-one to the 0x0..0xffff UNICODE range.
euc
assumes EUC external encoding and maps sub-code-set 0 to UNICODE range 0x0..0x7f, and sub-code-sets 1-3 to internal codes above 0xffff, as shown above.

Note that in order to support this composite character code set, we had to give up the ability to read and write UTF-8-encoded files with character codes above 0xffff (which is possible using the built-in utf8 WCX-mode of SICStus Prolog, (see Prolog Level WCX Features)).

The example uses a primitive character-type mapping: characters in the 0x80-0xff range are classified according to the latin1 encoding, above that range all characters are considered small-letters. However, as an example of re-classification, code 0xa1 (inverted exclamation mark) is categorized as solo-char.

The default system encoding is used (truncate to 8-bits).

The box has to be initialized by calling the C function wcx_setup(), which first reads the environment variable WCX_TYPE, and uses its value as the default encoding. It then calls SP_set_wcx_hooks(), and initializes its own conversion tables. In a runtime system wcx_setup() should be called before SP_initialize(), so that it effects the standard streams created there. The second phase of initialization, wcx_init_atoms(), has to be called after SP_initialize(), to set up variables storing the atoms naming the external encodings.

In a development system the two initialization phases can be put together, this is implemented as wcx_init(), and is declared to be a foreign entry point in wcx.pl.

On any subsequent creation of a stream, the hook function my_wcx_open() is called. This sets the wide character get and put function pointers in the stream according to the atom supplied in the wcx(...) option, or according to the value of the Prolog flag wcx.

Within the put function it may happen that a character code is to be output, which the given encoding cannot accommodate (a non-ASCII Unicode character on an EUC stream or vice-versa). No bytes are output in such a case and -1 is returned as an error code.

There is an additional foreign C function implemented in the sample WCX box: wcx_set_encoding(), available from Prolog as set_encoding/2. This allows changing the encoding of an already open stream. This is used primarily for standard input-output streams, while experimenting with the box.