12.5 Selecting the WCX Mode Using Hooks

Users can have complete control over the way wide characters are handled by SICStus Prolog if they supply their own definitions of appropriate hook functions. A set of such functions, implementing a specific environment for handling wide characters is called a WCX box. A sample WCX box is described below (see A Sample WCX Box).

Plugging-in of the WCX hook functions can be performed by calling

     void SP_set_wcx_hooks ( int usage,
                             SP_WcxOpenHook *wcx_open,
                             SP_WcxCloseHook *wcx_close,
                             SP_WcxCharTypeHook *wcx_chartype,
                             SP_WcxConvHook *wcx_from_os,
                             SP_WcxConvHook *wcx_to_os);

The effect of SP_set_wcx_hooks() is controlled by the value of usage. The remaining arguments are pointers to appropriate hook functions or NULL values, the latter implying that the hook should take some default value.

There are three independent aspects to be controlled, and usage should be supplied as a bitwise OR of chosen constant names for each aspect. The defaults have value 0, so need not be included. The aspects are the following:

  1. decide on the default code-set

    This decides the default behavior of the wcx_open and wcx_chartype hook functions (if both are supplied by the user, the choice of the default is irrelevant). The possible values are:

    WCX_USE_LATIN1 (default)
    WCX_USE_UTF8
    WCX_USE_EUC
    Select the behavior described above under titles iso_8859_1, utf8, and euc, respectively; see WCX Environment Variables.
  2. decide on the default system encoding

    The flags below determine what function to use for conversion from/to the operating system encoding, if such functions are not supplied by the user through the wcx_from_os and wcx_to_os arguments (if both are supplied by the user, the choice of default is irrelevant).

    WCX_OS_8BIT
    (default) Select the “truncation to 8-bits” behavior.
    WCX_OS_UTF8
    Select the UTF-8 encoding to be used for all communication with the operating system.
  3. decide on the preservation of ASCII, i.e. the codes in 0..127

    This is important if some of the conversion functions (wcx_from_os, wcx_to_os, and wcx_getc, wcx_putc, see later) are user-defined. In such cases it may be beneficial for the user to inform SICStus Prolog whether the supplied encoding functions preserve ASCII characters. (The default encodings do preserve ASCII.)

    WCX_PRESERVES_ASCII
    (default) Declare that the encodings preserve all ASCII characters, i.e. getting or putting an ASCII character need not go through the conversion functions, and for strings containing ASCII characters only, the system encoding conversions need not be invoked.
    WCX_CHANGES_ASCII
    Force the system to use the conversion functions even for ASCII characters and strings.

We now describe the role of the arguments following usage in the argument list of SP_set_wcx_hooks().

SP_WcxOpenHook *wcx_open
where typedef void (SP_WcxOpenHook) (SP_stream *s, SP_atom option, int context);

This function is called by SICStus Prolog for each s stream opened, except when the encoding to be used for the stream is pre-specified (binary files, files opened using the wci option, and the C streams created with contexts SP_STREAMHOOK_WCI and SP_STREAMHOOK_BIN).

The main task of the wcx_open hook is to associate the two WCX-processing functions with the stream, by storing them in the appropriate fields of the SP_stream data structure:

          SP_WcxGetcHook *wcx_getc;
          SP_WcxPutcHook *wcx_putc;
         

These fields are pointers to the functions performing the external decoding and encoding as described below. They are initialized to functions that truncate to 8 bits on output and zero-extend to 31 bits on input.

SP_WcxGetcHook *wcx_getc
where typedef int (SP_WcxGetcHook) (int first_byte, SP_stream *s, long *pbyte_count);

This function is generally invoked whenever a character has to be read from a stream. Before invoking this function, however, a byte is read from the stream by SICStus Prolog itself. If the byte read is an ASCII character (its value is < 128), and WCX_PRESERVES_ASCII is in force, then the byte read is deemed to be the next character code, and wcx_getc is not invoked. Otherwise, wcx_getc is invoked with the byte and stream in question and is expected to return the next character code.

The wcx_getc function may need to read additional bytes from the stream, if first byte signifies the start of a multi-byte character. A byte may be read from the stream s in the following way:

               byte = s->sgetc(s->user_handle);
                  

The wcx_getc function is expected to increment its *pbyte_count argument by 1 for each such byte read.

The default wcx_open hook will install a wcx_getc function according to the usage argument. The three default external decoding functions are also available to users through the SP_wcx_getc() function (see WCX Utility Functions).

SP_WcxPutcHook *wcx_putc
where typedef int (SP_WcxPutcHook) (int char_code, SP_stream *s, long *pbyte_count);

This function is generally invoked whenever a character has to be written to a stream. However, if the character code to be written is an ASCII character (its value is < 128), and WCX_PRESERVES_ASCII is in force, then the code is written directly on the stream, and wcx_putc is not invoked. Otherwise, wcx_putc is invoked with the character code and stream in question and is expected to do whatever is needed to output the character code to the stream.

This will require outputting one or more bytes to the stream. A byte byte can be written to the stream s in the following way:

               return_code = s->sputc(byte,s->user_handle);
                  

The wcx_putc function is expected to return the return value of the last invocation of s->sputc, or -1 as an error code, if incapable of outputting the character code. The latter may be the case, for example, if the code to be output does not belong to the character code set in force. It is also expected to increment its *pbyte_count argument by 1 for each byte written.

The default wcx_open hook function will install a wcx_putc function according to the usage argument. The three default external encoding functions are also available to users through the SP_wcx_putc() function (see WCX Utility Functions).

In making a decision regarding the selection of these WCX-processing functions, the context and option arguments of the wcx_open hook can be used. The option argument is an atom. The context argument encodes the context of invocation. It is one of the following values

SP_STREAMHOOK_STDIN
SP_STREAMHOOK_STDOUT
SP_STREAMHOOK_STDERR
for the three standard streams,
SP_STREAMHOOK_OPEN
for streams created by open/[3,4]
SP_STREAMHOOK_NULL
for streams created by open_null_stream/1
SP_STREAMHOOK_LIB
for streams created from the libraries
SP_STREAMHOOK_C, SP_STREAMHOOK_C+1, ...
for streams created from C code via SP_make_stream()

The option argument comes from the user and it can carry some WCX-related information to be associated with the stream opened. For example, this can be used to implement a scheme supporting multiple encodings, supplied on a stream-by-stream basis, as shown in the sample WCX-box (see A Sample WCX Box).

If the stream is opened from Prolog code, the option argument for this hook function is derived from the wcx(Option) option of open/4 and load_files/2. If this option is not present, or the stream is opened using some other built-in predicate, then the value of the wcx Prolog flag will be passed on to the open hook.

If the stream is opened from C, via SP_make_stream(), then the option argument will be the value of the Prolog flag wcx.

There is also a variant of SP_make_stream(), called SP_make_stream_context(), which takes two additional arguments, the option and the context, to be passed on to the wcx_open hook (see WCX Foreign Interface).

The wcx_open hook can associate the information derived from option with the stream in question using a new field in the SP_stream data structure: void *wcx_info, initialized to NULL. If there is more information than can be stored in this field, or if the encoding to be implemented requires keeping track of a state, then the wcx_open hook should allocate sufficient amount of memory for storing the information and/or the state, using SP_malloc(), and deposit a pointer to that piece of memory in wcx_info.

The default wcx_open hook function sets the wcx_getc and wcx_putc stream fields to functions performing the external decoding and encoding according to option. Permitted values for option are the same as for the SP_CTYPE environment variable; see WCX Environment Variables. If the option argument is not supported then the usage argument of SP_set_wcx_hooks() will be used instead.

Note that, if option or usage is euc then there will be no attempt to translate between UNICODE code points and EUC code points. For this reason it is probably not meaningful to mix EUC with any of the other supported encodings. You should not rely on this behavior, future versions of SICStus may do a proper translation of EUC to and from UNICODE.

As an example, if SP_CTYPE is utf8 you can load an ISO 8859/1 encoded prolog file using load_files('file.pl', [wcx(iso_8859_1)]).

SP_WcxCloseHook *wcx_close
where typedef void (SP_WcxCloseHook) (SP_stream *s);

This hook function is called whenever a stream is closed, for which the wcx_open hook was invoked at its creation. The argument s points to the stream being closed. It can be used to implement the closing activities related to external encoding, e.g. freeing any memory allocated in wcx_open hook.

The default wcx_close hook function does nothing.

SP_WcxCharTypeHook *wcx_chartype
where typedef int (SP_WcxCharTypeHook) (int char_code);

This function should be prepared to take any char_code >= 128 and return one of the following constants:

CHT_LAYOUT_CHAR
for additional characters in the syntactic category layout-char,
CHT_SMALL_LETTER
for additional characters in the syntactic category small-letter,
CHT_CAPITAL_LETTER
for additional characters in the syntactic category capital-letter,
CHT_SYMBOL_CHAR
for additional characters in the syntactic category symbol-char,
CHT_SOLO_CHAR
for additional characters in the syntactic category solo-char.

Regarding the meaning of these syntactic categories, see Token String.

The value returned by this function is not expected to change over time, therefore, for efficiency reasons, its behavior is cached. The cache is cleared by SP_set_wcx_hooks().

As a help in implementing this function, SICStus Prolog provides the function SP_latin1_chartype(), which returns the character type category for the codes 1..255 according to the ISO 8859/1 standard.

Note that if a character code >= 512 is categorized as a layout-char, and a character with this code occurs within an atom being written out in quoted form (e.g. using writeq) in native sicstus mode (as opposed to iso mode), then this code will be output as itself, rather than an octal escape sequence. This is because in sicstus mode escape sequences consist of at most 3 octal digits.

SP_WcxConvHook *wcx_to_os
where typedef char* (SP_WcxConvHook) (char *string, int context);

This function is normally called each time SICStus Prolog wishes to communicate a string of possibly wide characters to the operating system. However, if the string in question consists of ASCII characters only, and WCX_PRESERVES_ASCII is in force, then wcx_to_os may not be called, and the original string may be passed to the operating system.

The first argument of wcx_to_os is a zero terminated string, using the internal encoding of SICStus Prolog, namely UTF-8. The function is expected to convert the string to a form required by the operating system, in the context described by the second, context argument, and to return the converted string. If no conversion is needed, it should simply return its first argument. Otherwise, the conversion should be done in a memory area controlled by this function (preferably a static buffer, reused each time the function is called).

The second argument specifies the context of conversion. It can be one of the following integer values:

WCX_FILE
the string is a file-name,
WCX_OPTION
the string is a command, a command line argument or an environment variable,
WCX_WINDOW_TITLE
the string is a window title,
WCX_C_CODE
the string is a C identifier (used, e.g. in the glue code)

SICStus Prolog provides a utility function SP_wci_code(), see below, for obtaining a wide character code from a UTF-8 encoded string, which can be used to implement the wcx_to_os hook function.

The default of the wcx_to_os function depends on the usage argument of SP_set_wcx_hooks(). If the value of usage includes WCX_OS_UTF8, then the function does no conversion, as the operating system uses the same encoding as SICStus Prolog. If the value of usage includes WCX_OS_8BIT, then the function decodes the UTF-8 encoded string and converts this sequence of codes into a sequence of bytes by truncating each code to 8 bits.

Note that the default wcx_to_os functions ignore their context argument.

SP_WcxConvHook *wcx_from_os
where typedef char* (SP_WcxConvHook) (char *string, int context);

This function is called each time SICStus Prolog receives from the operating system a zero terminated sequence of bytes possibly encoding a wide character string. The function is expected to convert the byte sequence, if needed, to a string in the internal encoding of SICStus Prolog (UTF-8), and return the converted string. The conversion should be done in a memory area controlled by this function (preferably a static buffer, reused each time the function is called, but different from the buffer used in wcx_to_os).

The second argument specifies the context of conversion, as in the case of wcx_to_os.

SICStus Prolog provides a utility function SP_code_wci(), see below, for converting a character code (up to 31 bits) into UTF-8 encoding, which can be used to implement the wcx_from_os hook function.

The default of the wcx_from_os function depends on the usage argument of SP_set_wcx_hooks(). If the value of usage includes WCX_OS_UTF8, then the function does no conversion. If the value of usage includes WCX_OS_8BIT, then the function transforms the string of 8-bit codes into a UTF-8 encoded string.

Note that the default wcx_from_os functions ignore their context argument.