4.1.7.5 Syntax of Tokens as Character Strings

SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.

Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.

Only character codes 0..255, i.e. the ISO-8859-1 (Latin 1) subset of Unicode, can be part of unquoted tokens1, unless the Prolog flag legacy_char_classification is set; see ref-lps-flg. This restriction may be lifted in the future.

For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the whitespace-char category except that space (character code 32) is allowed despite it being a whitespace-char.

An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently not enforced. A future release may enforce this restriction or perform this normalization automatically.

NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.

When the Prolog flag legacy_char_classification is set, characters in the whitespace-char category are still treated as whitespace but other character codes outside the range 0..255, assigned to non-private abstract characters in Unicode 5.0, is treated as lower case. Such characters can therefore appear as themself, without using escape sequences, both in quoted and unquoted tokens.

Note: Any output produced by write_term/2 with the option quoted(true) will be in NFC. This includes output from writeq/[1,2] and write_canonical/[1,2].

whitespace-char
These are character codes 0..32, 127..160, 8206..8207, and 8232..8233. This includes ASCII characters such as <TAB>, <LFD>, and <SPC>, as well as all characters with Unicode property “Pattern_Whitespace” including the Unicode-specific <LINE SEPARATOR> (8232).
small-letter
These are character codes 97..122, i.e. the letters ‘a’ through ‘z’, as well as the non-ASCII character codes 170, 186, 223..246, and 248..255.

If the Prolog flag legacy_char_classification (see ref-lps-flg) is set then the small-letter set will also include almost every code point above 255 assigned to non-private abstract characters in Unicode 5.0.

capital-letter
These are character codes 65..90, i.e. the letters ‘A’ through ‘Z’, as well as the non-ASCII character codes 192..214, and 216..222.
digit
These are character codes 48..57, i.e. the digits ‘0’ through ‘9’.
symbol-char
These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:
          + - * / \ ^ < > = ~ : . ? @ # $ &

In addition, the non-ASCII character codes 161..169, 171..185, 187..191, 215, and 247 belong to this character type2.

solo-char
These are character codes 33 and 59 i.e. the characters ‘!’ and ‘;’.
punctuation-char
These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters:
          % ( ) , [ ] { | }

quote-char
These are character codes 34, 39, and 96 i.e. the characters ‘"’, ‘'’, and ‘`’.
underline
This is character code 95 i.e. the character ‘_’.

Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.

token ::= name
| natural-number
| unsigned-float
| variable
| string
| punctuation-char
| whitespace-text
| full-stop

name ::= quoted-name
| word
| symbol
| solo-char
| [ ?whitespace-text ]
| { ?whitespace-text }

word ::= small-letter ?alpha...

symbol ::= symbol-char... { except in the case of a full-stop or where the first 2 chars are ‘/*’ }

natural-number ::= digit...
| base-prefix alpha... { where each alpha must be digits of the base indicated by base-prefix, treating a,b,... and A,B,... as 10,11,... }
| 0 ' char-item { yielding the character code for char }

unsigned-float ::= simple-float
| simple-float exp exponent

simple-float ::= digit... . digit...

exp ::= e | E

exponent ::= digit... | sign digit...

sign ::= - | +

variable ::= underline ?alpha...
| capital-letter ?alpha...

string ::= " ?string-item... "

string-item ::= quoted-char { other than ‘"’ or ‘\’ }
| ""
| \ escape-sequence

quoted-atom ::= ' ?quoted-item... '

quoted-item ::= quoted-char { other than ‘'’ or ‘\’ }
| ''
| \ escape-sequence

backquoted-atom ::= ` ?backquoted-item... `

backquoted-item ::= quoted-char { other than ‘`’ or ‘\’ }
| ``
| \ escape-sequence

whitespace-text ::= whitespace-text-item...

whitespace-text-item ::= whitespace-char | comment

comment ::= /* ?char... */ { where ?char... must not contain ‘*/’ }
| % ?char... <LFD> { where ?char... must not contain <LFD> }

full-stop ::= . { the following token, if any, must be whitespace-text}

char ::= whitespace-char
| printing-char

printing-char ::= alpha
| symbol-char
| solo-char
| punctuation-char
| quote-char

alpha ::= capital-letter | small-letter | digit | underline

escape-sequence ::= b { backspace, character code 8 }
| t { horizontal tab, character code 9 }
| n { newline, character code 10 }
| v { vertical tab, character code 11 }
| f { form feed, character code 12 }
| r { carriage return, character code 13 }
| e { escape, character code 27 }
| d { delete, character code 127 }
| a { alarm, character code 7 }
| other-escape-sequence

quoted-name ::= quoted-atom
| backquoted-atom

base-prefix ::= 0b { indicates base 2 }
| 0o { indicates base 8 }
| 0x { indicates base 16 }

char-item ::= quoted-item

other-escape-sequence ::= x alpha... \ {treating a,b,... and A,B,... as 10,11,... } in the range [0..15], hex character code }
| digit... \ { in the range [0..7], octal character code }
| <LFD> { ignored }
| \ { stands for itself }
| ' { stands for itself }
| " { stands for itself }
| ` { stands for itself }

quoted-char ::= <SPC>
| printing-char

Footnotes

[1] Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).

[2] In release 3 and 4.0.0 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in release 4.0.1.



Send feedback on this subject.