Next: ref-syn-syn-esc, Previous: ref-syn-syn-trm, Up: ref-syn-syn [Contents][Index]
SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.
Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.
Only character codes 0..255, i.e. the ISO-8859-1 (Latin 1)
subset of Unicode, can be part of unquoted tokens3, unless the Prolog
flag legacy_char_classification
is set; see ref-lps-flg. This
restriction may be lifted in the future.
For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the whitespace-char category except that space (character code 32) is allowed despite it being a whitespace-char.
An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently not enforced. A future release may enforce this restriction or perform this normalization automatically.
NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.
When the Prolog flag legacy_char_classification
is set,
characters in the whitespace-char category are still treated as whitespace
but other character codes outside the range 0..255, assigned to
non-private abstract characters in Unicode 5.0, are treated as lower
case. Such characters can therefore appear as themselves, without using
escape sequences, both in quoted and unquoted tokens.
Note: Any output produced by write_term/2
with the option
quoted(true)
will be in NFC. This includes output from
writeq/[1,2]
and write_canonical/[1,2]
.
These are character codes 0..32, 127..160, 8206..8207, and 8232..8233. This includes ASCII characters such as TAB, LFD, and SPC, as well as all characters with Unicode property “Pattern_Whitespace” including the Unicode-specific LINE SEPARATOR (8232).
These are character codes 97..122, i.e. the letters ‘a’ through ‘z’, as well as the non-ASCII character codes 170, 186, 223..246, and 248..255.
If the Prolog flag legacy_char_classification
(see ref-lps-flg) is set, then the small-letter set will also
include almost every code point above 255 assigned to non-private
abstract characters in Unicode 5.0.
These are character codes 65..90, i.e. the letters ‘A’ through ‘Z’, as well as the non-ASCII character codes 192..214, and 216..222.
These are character codes 48..57, i.e. the digits ‘0’ through ‘9’.
These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:
+ - * / \ ^ < > = ~ : . ? @ # $ &
In addition, the non-ASCII character codes 161..169, 171..185, 187..191, 215, and 247 belong to this character type4.
These are character codes 33 and 59 i.e. the characters ‘!’ and ‘;’.
These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters:
% ( ) , [ ] { | }
These are character codes 34 and 39 i.e. the characters ‘"’ and ‘'’.
This is character code 95 i.e. the character ‘_’.
Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.
token | ::= name | |
| natural-number | ||
| unsigned-float | ||
| variable | ||
| string | ||
| punctuation-char | ||
| whitespace-text | ||
| full stop | ||
name | ::= quoted-name | |
| word | ||
| symbol | ||
| solo-char | ||
| [ ?whitespace-text ] | ||
| { ?whitespace-text } | ||
word | ::= small-letter ?alpha… | |
symbol | ::= symbol-char… | { except in the case of a full stop or where the first 2 chars are ‘/*’ } |
natural-number | ::= digit… | |
| base-prefix alpha… | { where each alpha must be digits of the base indicated by base-prefix, treating a,b,… and A,B,… as 10,11,… } | |
| 0 ' char-item | { yielding the character code for char } | |
unsigned-float | ::= simple-float | |
| simple-float exp exponent | ||
simple-float | ::= digit… . digit… | |
exp | ::= e | E | |
exponent | ::= digit… | sign digit… | |
sign | ::= - | + | |
variable | ::= underline ?alpha… | |
| capital-letter ?alpha… | ||
string | ::= " ?string-item… " | |
string-item | ::= quoted-char | { other than ‘"’ or ‘\’ } |
| "" | ||
| \ escape-sequence | ||
quoted-atom | ::= ' ?quoted-item… ' | |
quoted-item | ::= quoted-char | { other than ‘'’ or ‘\’ } |
| '' | ||
| \ escape-sequence | ||
whitespace-text | ::= whitespace-text-item… | |
whitespace-text-item | ::= whitespace-char | comment | |
comment | ::= /* ?char… */ | { where ?char… must not contain ‘*/’ } |
| % ?char… LFD | { where ?char… must not contain LFD } | |
full stop | ::= . | { the following token, if any, must be whitespace-text} |
char | ::= whitespace-char | |
| printing-char | ||
printing-char | ::= alpha | |
| symbol-char | ||
| solo-char | ||
| punctuation-char | ||
| quote-char | ||
alpha | ::= capital-letter | small-letter | digit | underline | |
escape-sequence | ::= b | { backspace, character code 8 } |
| t | { horizontal tab, character code 9 } | |
| n | { newline, character code 10 } | |
| v | { vertical tab, character code 11 } | |
| f | { form feed, character code 12 } | |
| r | { carriage return, character code 13 } | |
| e | { escape, character code 27 } | |
| d | { delete, character code 127 } | |
| a | { alarm, character code 7 } | |
| other-escape-sequence | ||
quoted-name | ::= quoted-atom | |
base-prefix | ::= 0b | { indicates base 2 } |
| 0o | { indicates base 8 } | |
| 0x | { indicates base 16 } | |
char-item | ::= quoted-item | |
other-escape-sequence | ::= x alpha… \ | {treating a,b,… and A,B,… as 10,11,… } in the range [0..15], hex character code } |
| digit… \ | { in the range [0..7], octal character code } | |
| LFD | { ignored } | |
| \ | { stands for itself } | |
| ' | { stands for itself } | |
| " | { stands for itself } | |
| ` | { stands for itself } | |
quoted-char | ::= SPC | |
| printing-char |
Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).
In release 3 and 4.0.0 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in release 4.0.1.