SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.
Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.
Only character codes 0..255, i.e. the ISO-8859-1 (Latin 1)
subset of Unicode, can be part of unquoted tokens1, unless the Prolog
flag legacy_char_classification
is set; see ref-lps-flg. This
restriction may be lifted in the future.
For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the whitespace-char category except that space (character code 32) is allowed despite it being a whitespace-char.
An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently not enforced. A future release may enforce this restriction or perform this normalization automatically.
NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.
When the Prolog flag legacy_char_classification
is set,
characters in the whitespace-char category are still treated as whitespace
but other character codes outside the range 0..255, assigned to
non-private abstract characters in Unicode 5.0, is treated as lower
case. Such characters can therefore appear as themself, without using
escape sequences, both in quoted and unquoted tokens.
Note: Any output produced by write_term/2
with the option
quoted(true)
will be in NFC. This includes output from
writeq/[1,2]
and write_canonical/[1,2]
.
If the Prolog flag legacy_char_classification
(see ref-lps-flg) is set then the small-letter set will also
include almost every code point above 255 assigned to non-private
abstract characters in Unicode 5.0.
+ - * / \ ^ < > = ~ : . ? @ # $ &
In addition, the non-ASCII character codes 161..169, 171..185, 187..191,
215, and 247 belong to this character type2.
% ( ) , [ ] { | }
Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.
token | ::= name
| |
| natural-number
| ||
| unsigned-float
| ||
| variable
| ||
| string
| ||
| punctuation-char
| ||
| whitespace-text
| ||
| full-stop
| ||
name | ::= quoted-name
| |
| word
| ||
| symbol
| ||
| solo-char
| ||
| [ ?whitespace-text ]
| ||
| { ?whitespace-text }
| ||
word | ::= small-letter ?alpha...
| |
symbol | ::= symbol-char... | { except in the case of a full-stop or where the first 2 chars are ‘/*’ }
|
natural-number | ::= digit...
| |
| base-prefix alpha... | { where each alpha must be digits of the base indicated by base-prefix, treating a,b,... and A,B,... as 10,11,... }
| |
| 0 ' char-item | { yielding the character code for char }
| |
unsigned-float | ::= simple-float
| |
| simple-float exp exponent
| ||
simple-float | ::= digit... . digit...
| |
exp | ::= e | E
| |
exponent | ::= digit... | sign digit...
| |
sign | ::= - | +
| |
variable | ::= underline ?alpha...
| |
| capital-letter ?alpha...
| ||
string | ::= " ?string-item... "
| |
string-item | ::= quoted-char | { other than ‘"’ or ‘\’ }
|
| ""
| ||
| \ escape-sequence
| ||
quoted-atom | ::= ' ?quoted-item... '
| |
quoted-item | ::= quoted-char | { other than ‘'’ or ‘\’ }
|
| ''
| ||
| \ escape-sequence
| ||
backquoted-atom | ::= ` ?backquoted-item... `
| |
backquoted-item | ::= quoted-char | { other than ‘`’ or ‘\’ }
|
| ``
| ||
| \ escape-sequence
| ||
whitespace-text | ::= whitespace-text-item...
| |
whitespace-text-item | ::= whitespace-char | comment
| |
comment | ::= /* ?char... */ | { where ?char... must not contain ‘*/’ }
|
| % ?char... <LFD> | { where ?char... must not contain <LFD> }
| |
full-stop | ::= . | { the following token, if any, must be whitespace-text}
|
char | ::= whitespace-char
| |
| printing-char
| ||
printing-char | ::= alpha
| |
| symbol-char
| ||
| solo-char
| ||
| punctuation-char
| ||
| quote-char
| ||
alpha | ::= capital-letter | small-letter | digit | underline
| |
escape-sequence | ::= b | { backspace, character code 8 }
|
| t | { horizontal tab, character code 9 }
| |
| n | { newline, character code 10 }
| |
| v | { vertical tab, character code 11 }
| |
| f | { form feed, character code 12 }
| |
| r | { carriage return, character code 13 }
| |
| e | { escape, character code 27 }
| |
| d | { delete, character code 127 }
| |
| a | { alarm, character code 7 }
| |
| other-escape-sequence
| ||
quoted-name | ::= quoted-atom
| |
| backquoted-atom
| ||
base-prefix | ::= 0b | { indicates base 2 }
|
| 0o | { indicates base 8 }
| |
| 0x | { indicates base 16 }
| |
char-item | ::= quoted-item
| |
other-escape-sequence | ::= x alpha... \ | {treating a,b,... and A,B,... as 10,11,... } in the range [0..15], hex character code }
|
| digit... \ | { in the range [0..7], octal character code }
| |
| <LFD> | { ignored }
| |
| \ | { stands for itself }
| |
| ' | { stands for itself }
| |
| " | { stands for itself }
| |
| ` | { stands for itself }
| |
quoted-char | ::= <SPC>
| |
| printing-char
|
[1] Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).
[2] In release 3 and 4.0.0 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in release 4.0.1.