This example implements a WCX box supporting the use of four
external encodings within the same SICStus Prolog invocation: ISO
Latin1, ISO Latin2 (ISO 8859/2), UNICODE, and EUC. The code is included
in the distribution as library(wcx_example)
.
The default encoding functions supplied in SICStus Prolog deal with a single encoding only. However, the interface does allow the implementation of WCX boxes supporting different encodings for different streams.
A basic assumption in SICStus Prolog is that there is a single character set. If we are to support multiple encodings we have to map them into a single character set. For example, the single-byte character sets ISO Latin1 and ISO Latin2 can be easily mapped to the Unicode character set. On the other hand there does not seem to be a simple mapping of the whole of EUC character set to UNICODE or the other way round.
Therefore, in this example, we use a composite character set, which covers both EUC and Unicode, but does not deal with unifying the character codes of characters appearing in both character sets, except for the case of ASCII characters.
The figure below depicts the structure of the composite character set of the sample WCX box.
.------------------. | EUC | | | | | | .+++++++++++++++++++++++++++. | + ASCII * LATIN1 | + .--------+=========*========== + + LATIN2 * + +********** + + + + + + UNICODE + .+++++++++++++++++++++++++++.
This character code set uses character codes up to 24 bit wide:
0 =< code =< 2^16-1
code = 2^16 + euc_code
euc_code
(as described in Representation of EUC Wide Characters).
The four external encodings supported by the sample WCX box can be
specified on a stream-by-stream basis, by supplying a
wcx(
ENC)
option to open/4
, where ENC
is
one of the atoms
latin1
, latin2
, unicode
or
euc
.
The mapping of these external encodings to the composite character code set is done in the following way:
Note that in order to support this composite character code set,
we had to give up the ability to read and write UTF-8-encoded files with
character codes above 0xffff (which is possible using the built-in
utf8
WCX-mode of SICStus Prolog, (see Prolog Level WCX Features)).
The example uses a primitive character-type mapping: characters in
the 0x80-0xff range are classified according to the latin1
encoding, above that range all characters are considered
small-letters. However, as an example of re-classification, code
0xa1 (inverted exclamation mark) is categorized as solo-char.
The default system encoding is used (truncate to 8-bits).
The box has to be initialized by calling the C function
wcx_setup()
, which first reads the environment variable
WCX_TYPE
, and uses its value as the default encoding. It then
calls SP_set_wcx_hooks()
, and initializes its own conversion
tables. In a runtime system wcx_setup()
should be called
before SP_initialize()
, so that it effects the standard
streams created there. The second phase of initialization,
wcx_init_atoms()
, has to be called after SP_initialize()
,
to set up variables storing the atoms naming the external
encodings.
In a development system the two initialization phases can be put
together, this is implemented as wcx_init()
, and is declared to be
a foreign entry point in wcx.pl
.
On any subsequent creation of a stream, the hook function
my_wcx_open()
is called. This sets the wide character get and put
function pointers in the stream according to the atom
supplied in the wcx(...)
option, or according to the value of
the Prolog flag wcx
.
Within the put function it may happen that a character code is to be output, which the given encoding cannot accommodate (a non-ASCII Unicode character on an EUC stream or vice-versa). No bytes are output in such a case and -1 is returned as an error code.
There is an additional foreign C function implemented in the sample WCX
box: wcx_set_encoding()
, available from Prolog as
set_encoding/2
. This allows changing the encoding of an already
open stream. This is used primarily for standard input-output
streams, while experimenting with the box.