Node:WCX Concepts, Next:, Previous:WCX Introduction, Up:Handling Wide Characters



Concepts

First let us introduce some notions concerning wide characters.

(Wide) character code
an integer, possibly outside the 0..255 range.

SICStus Prolog allows character codes in the range 0..2147483647 (= 2^31-1). Consequently, the built-in predicates for building and decomposing atoms from/into character codes (e.g. atom_codes/2, name/2, etc.) accept and produce lists of integers in the above range (excluding the 0 code).

Wide characters can be used in all contexts: in atoms (single quoted, or unquoted, depending on the character-type mapping), strings, character code notation (0'char), etc.

External (stream) encoding
a way of encoding sequences of wide characters as sequences of (8-bit) bytes, used in stream input and output.

SICStus Prolog has three different external stream encoding schemes built-in, selectable through an environment variable. Furthermore it provides hooks for users to plug in their own external stream encoding functions. The built-in predicates put_code/1, get_code/1, etc., accept and return wide character codes, converting the bytes written or read using the external encoding in force.

Note that an encoding need not be able to handle the whole range of character codes allowed by SICStus Prolog.

Character code set
a subset of the set {0, ..., 2^31-1} that can be handled by an external encoding. SICStus Prolog assumes that the character code set is an extension of the ASCII code set, i.e. it includes codes 0..127, and these codes are interpreted as ASCII characters. Note that ASCII characters can still have an arbitrary external encoding, cf. the usage flag WCX_CHANGES_ASCII; see WCX Hooks.
Character type mapping
a function mapping each element of the character code set to one of the character categories (layout, small-letter, symbol-char, etc.; see Token String). This is required for parsing tokens. The character-type mapping for non-ASCII characters is hookable in SICStus Prolog and has three built-in defaults, depending on the external encoding selected.
System encoding
a way of encoding wide character strings, used or required by the operating system environment in various contexts (e.g. file names in open/3, command line options, as returned by prolog_flag(argv, Flags), etc.). The system encoding is hookable in SICStus Prolog and has two built-in defaults.
Internal encoding
a way of encoding wide character strings internally within the SICStus Prolog system. This is of interest to the user only if the foreign language interface is used in the program, or a system encoding hook function needs to be written. SICStus Prolog has a fixed internal encoding, which is UTF-8.

As discussed above there are several points where the users can influence the behavior of SICStus Prolog. The user can decide on

Let us call WCX mode a particular setting of these parameters.

Note that the selection of the character code set is conceptual only and need not be communicated to SICStus Prolog, as the decision materializes in the functions for the mapping and encodings.