Text Stream Encodings

SICStus Prolog supports character codes up to 31 bits wide where the codes are interpreted as for Unicode for the common subset.

When a character code (a “code point” in Unicode terminology) is read or written to a stream, it must be encoded into a byte sequence. The method by which each character code is encoded to or decoded from a byte sequence is called “character encoding”.

The following character encodings are currently supported by SICStus Prolog.

The 7-bit subset of Unicode, commonly referred to as ASCII.
The 8-bit subset of Unicode, commonly referred to as Latin 1.
A variant of ISO-8859-1, commonly referred to as Latin 2.
A variant of ISO-8859-1, commonly referred to as Latin 9.
windows 1252
The Microsoft Windows code page 1252.
The suffixes LE and BE denote respectively little endian and big endian.

These encodings can be auto-detected if a Unicode signature is present in a file opened for read. A Unicode signature is also known as a Byte order mark (BOM).

In addition, it is possible to use all alternative names defined by the IANA registry http://www.iana.org/assignments/character-sets.

All encodings in the table above, except the UTF-XXX encodings, supports the reposition(true) option to open/4 (see mpg-ref-open).

The encoding to use can be specified when using open/4 and similar predicates using the option encoding/1. When opening a file for input, the encoding can often be determined automatically. The default is ISO-8859-1 if no encoding is specified and no encoding can be detected from the file contents.

The encoding used by a text stream can be queried using stream_property/2.

See mpg-ref-open for details on how character encoding is auto-detected when opening text files.

Send feedback on this subject.