4.1.7.5 Syntax of Tokens as Character Strings

SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.

Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.

Only character codes 0..255, i.e. the ISO 8859/1 (Latin 1) subset of Unicode, can be part of unquoted tokens1. This restriction may be lifted in the future.

For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the layout-char category except that space (character code 32) is allowed despite it being a layout-char.

An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently (SICStus Prolog 4.0.1) not enforced. A future version of SICStus Prolog may enforce this restriction or perform this normalization automatically.

NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.

Note: Any output produced by write_term/2 with the option quoted(true) will be in NFC. This includes output from writeq/[1,2] and write_canonical/[1,2].

layout-char
These are character codes 0..32, 127..160, 8206..8207, and 8232..8233. This includes ASCII characters such as <TAB>, <LFD>, and <SPC>, as well as all characters with Unicode property “Pattern_Whitespace” including the Unicode-specific <LINE SEPARATOR> (8232).
small-letter
These are character codes 97..122, i.e. the letters `a' through `z', as well as the non-ASCII character codes 170, 186, 223..246, and 248..255.
capital-letter
These are character codes 65..90, i.e. the letters `A' through `Z', as well as the non-ASCII character codes 192..214, and 216..222.
digit
These are character codes 48..57, i.e. the digits `0' through `9'.
symbol-char
These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:
          + - * / \ ^ < > = ~ : . ? @ # $ &
     

In addition, the non-ASCII character codes 161..169, 171..185, 187..191, 215, and 247 belong to this character type2.

solo-char
These are character codes 33 and 59 i.e. the characters `!' and `;'.
punctuation-char
These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters:
          % ( ) , [ ] { | }
     

quote-char
These are character codes 34, 39, and 96 i.e. the characters `"', `'', and ``'.
underline
This is character code 95 i.e. the character `_'.

Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.

token ::= name
| natural-number
| unsigned-float
| variable
| string
| punctuation-char
| layout-text
| full-stop

name ::= quoted-name
| word
| symbol
| solo-char
| [ ?layout-text ]
| { ?layout-text }

word ::= small-letter ?alpha...

symbol ::= symbol-char... { except in the case of a full-stop or where the first 2 chars are `/*' }

natural-number ::= digit...
| base-prefix alpha... { where each alpha must be digits of the base indicated by base-prefix, treating a,b,... and A,B,... as 10,11,... }
| 0 ' char-item { yielding the character code for char }

unsigned-float ::= simple-float
| simple-float exp exponent

simple-float ::= digit... . digit...

exp ::= e | E

exponent ::= digit... | sign digit...

sign ::= - | +

variable ::= underline ?alpha...
| capital-letter ?alpha...

string ::= " ?string-item... "

string-item ::= quoted-char { other than `"' or `\' }
| ""
| \ escape-sequence

quoted-atom ::= ' ?quoted-item... '

quoted-item ::= quoted-char { other than `'' or `\' }
| ''
| \ escape-sequence

backquoted-atom ::= ` ?backquoted-item... `

backquoted-item ::= quoted-char { other than ``' or `\' }
| ``
| \ escape-sequence

layout-text ::= layout-text-item...

layout-text-item ::= layout-char | comment

comment ::= /* ?char... */ { where ?char... must not contain `*/' }
| % ?char... <LFD> { where ?char... must not contain <LFD> }

full-stop ::= . { the following token, if any, must be layout-text}

char ::= layout-char
| printing-char

printing-char ::= alpha
| symbol-char
| solo-char
| punctuation-char
| quote-char

alpha ::= capital-letter | small-letter | digit | underline

escape-sequence ::= b { backspace, character code 8 }
| t { horizontal tab, character code 9 }
| n { newline, character code 10 }
| v { vertical tab, character code 11 }
| f { form feed, character code 12 }
| r { carriage return, character code 13 }
| e { escape, character code 27 }
| d { delete, character code 127 }
| a { alarm, character code 7 }
| other-escape-sequence

quoted-name ::= quoted-atom
| backquoted-atom

base-prefix ::= 0b { indicates base 2 }
| 0o { indicates base 8 }
| 0x { indicates base 16 }

char-item ::= quoted-item

other-escape-sequence ::= x alpha... \ {treating a,b,... and A,B,... as 10,11,... } in the range [0..15], hex character code }
| o digit... \ { in the range [0..7], octal character code }
| <LFD> { ignored }
| \ { stands for itself }
| ' { stands for itself }
| " { stands for itself }
| ` { stands for itself }

quoted-char ::= <SPC>
| printing-char

Footnotes

[1] Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).

[2] In SICStus Prolog 4.0.0 and in SICStus 3 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in SICStus Prolog 4.0.1.



Send feedback on this subject.