SICStus Prolog: ref-syn-syn-tok

4.1.7.5 Syntax of Tokens as Character Strings

SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.

Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.

Only character codes 0..255, i.e. the ISO-8859-1 (Latin 1) subset of Unicode, can be part of unquoted tokens³, unless the Prolog flag legacy_char_classification is set; see ref-lps-flg. This restriction may be lifted in the future.

For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the whitespace-char category except that space (character code 32) is allowed despite it being a whitespace-char.

An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently not enforced. A future release may enforce this restriction or perform this normalization automatically.

NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.

When the Prolog flag legacy_char_classification is set, characters in the whitespace-char category are still treated as whitespace but other character codes outside the range 0..255, assigned to non-private abstract characters in Unicode 5.0, are treated as lower case. Such characters can therefore appear as themselves, without using escape sequences, both in quoted and unquoted tokens.

Note: Any output produced by write_term/2 with the option quoted(true) will be in NFC. This includes output from writeq/[1,2] and write_canonical/[1,2].

whitespace-char

These are character codes 0..32, 127..160, 8206..8207, and 8232..8233. This includes ASCII characters such as TAB, LFD, and SPC, as well as all characters with Unicode property “Pattern_Whitespace” including the Unicode-specific LINE SEPARATOR (8232).

small-letter

These are character codes 97..122, i.e. the letters ‘a’ through ‘z’, as well as the non-ASCII character codes 170, 186, 223..246, and 248..255.

If the Prolog flag legacy_char_classification (see ref-lps-flg) is set, then the small-letter set will also include almost every code point above 255 assigned to non-private abstract characters in Unicode 5.0.

capital-letter

These are character codes 65..90, i.e. the letters ‘A’ through ‘Z’, as well as the non-ASCII character codes 192..214, and 216..222.

digit

These are character codes 48..57, i.e. the digits ‘0’ through ‘9’.

symbol-char

These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:

+ - * / \ ^ < > = ~ : . ? @ # $ &

In addition, the non-ASCII character codes 161..169, 171..185, 187..191, 215, and 247 belong to this character type⁴.

solo-char

These are character codes 33 and 59 i.e. the characters ‘!’ and ‘;’.

punctuation-char

These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters:

% ( ) , [ ] { | }

quote-char

These are character codes 34 and 39 i.e. the characters ‘"’ and ‘'’.

underline

This is character code 95 i.e. the character ‘_’.

Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.

`token`	::= `name`
	\| `natural-number`
	\| `unsigned-float`
	\| `variable`
	\| `string`
	\| `punctuation-char`
	\| `whitespace-text`
	\| `full-stop`

`name`	::= `quoted-name`
	\| `word`
	\| `symbol`
	\| `solo-char`
	\| `[` `?whitespace-text` `]`
	\| `{` `?whitespace-text` `}`

`word`	::= `small-letter` `?alpha…`

`symbol`	::= `symbol-char…`	{ except in the case of a `full-stop` or where the first 2 chars are ‘`/*`’ }

`natural-number`	::= `digit…`
	\| `base-prefix` `alpha…`	{ where each `alpha` must be digits of the base indicated by `base-prefix`, treating a,b,… and A,B,… as 10,11,… }
	\| `0` `'` `char-item`	{ yielding the character code for `char` }

`unsigned-float`	::= `simple-float`
	\| `simple-float` `exp` `exponent`

`simple-float`	::= `digit…` `.` `digit…`

`exp`	::= `e` \| `E`

`exponent`	::= `digit…` \| `sign` `digit…`

`sign`	::= `-` \| `+`

`variable`	::= `underline` `?alpha…`
	\| `capital-letter` `?alpha…`

`string`	::= `"` `?string-item…` `"`

`string-item`	::= `quoted-char`	{ other than ‘`"`’ or ‘`\`’ }
	\| `""`
	\| `\` `escape-sequence`

`quoted-atom`	::= `'` `?quoted-item…` `'`

`quoted-item`	::= `quoted-char`	{ other than ‘`'`’ or ‘`\`’ }
	\| `''`
	\| `\` `escape-sequence`

`whitespace-text`	::= `whitespace-text-item…`

`whitespace-text-item`	::= `whitespace-char` \| `comment`

`comment`	::= `/` `?char…` `/`	{ where `?char…` must not contain ‘`*/`’ }
	\| `%` `?char…` `LFD`	{ where `?char…` must not contain `LFD` }

`full-stop`	::= `.`	{ the following token, if any, must be `whitespace-text`}

`char`	::= `whitespace-char`
	\| `printing-char`

`printing-char`	::= `alpha`
	\| `symbol-char`
	\| `solo-char`
	\| `punctuation-char`
	\| `quote-char`

`alpha`	::= `capital-letter` \| `small-letter` \| `digit` \| `underline`

`escape-sequence`	::= `b`	{ backspace, character code 8 }
	\| `t`	{ horizontal tab, character code 9 }
	\| `n`	{ newline, character code 10 }
	\| `v`	{ vertical tab, character code 11 }
	\| `f`	{ form feed, character code 12 }
	\| `r`	{ carriage return, character code 13 }
	\| `e`	{ escape, character code 27 }
	\| `d`	{ delete, character code 127 }
	\| `a`	{ alarm, character code 7 }
	\| `other-escape-sequence`

`quoted-name`	::= `quoted-atom`

`base-prefix`	::= `0b`	{ indicates base 2 }
	\| `0o`	{ indicates base 8 }
	\| `0x`	{ indicates base 16 }

`char-item`	::= `quoted-item`

`other-escape-sequence`	::= `x` `alpha…` `\`	{treating a,b,… and A,B,… as 10,11,… } in the range [0..15], hex character code }
	\| `digit…` `\`	{ in the range [0..7], octal character code }
	\| `LFD`	{ ignored }
	\| `\`	{ stands for itself }
	\| `'`	{ stands for itself }
	\| `"`	{ stands for itself }
	\| `	{ stands for itself }

`quoted-char`	::= `SPC`
	\| `printing-char`

Footnotes

(3)

Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).

(4)

In release 3 and 4.0.0 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in release 4.0.1.

Send feedback on this subject.