Next: , Previous: , Up: ref-syn-syn   [Contents][Index]


4.1.7.5 Syntax of Tokens as Character Strings

SICStus Prolog supports wide characters (up to 31 bits wide), interpreted as a superset of Unicode.

Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.

Only character codes 0..255, i.e. the ISO-8859-1 (Latin 1) subset of Unicode, can be part of unquoted tokens3, unless the Prolog flag legacy_char_classification is set; see ref-lps-flg. This restriction may be lifted in the future.

For quoted tokens, i.e. quoted atoms and strings, almost any sequence of code points assigned to non-private abstract characters in Unicode 5.0 is allowed. The disallowed characters are those in the whitespace-char category except that space (character code 32) is allowed despite it being a whitespace-char.

An additional restriction is that the sequence of characters that makes up a quoted token must be in Normal Form C (NFC) http://www.unicode.org/reports/tr15/. This is currently not enforced. A future release may enforce this restriction or perform this normalization automatically.

NFC is the normalization form used on the web (http://www.w3.org/TR/charmod/) and what most software can be expected to produce by default. Any sequence consisting of only characters from Latin 1 is already in NFC.

When the Prolog flag legacy_char_classification is set, characters in the whitespace-char category are still treated as whitespace but other character codes outside the range 0..255, assigned to non-private abstract characters in Unicode 5.0, are treated as lower case. Such characters can therefore appear as themselves, without using escape sequences, both in quoted and unquoted tokens.

Note: Any output produced by write_term/2 with the option quoted(true) will be in NFC. This includes output from writeq/[1,2] and write_canonical/[1,2].

whitespace-char

These are character codes 0..32, 127..160, 8206..8207, and 8232..8233. This includes ASCII characters such as TAB, LFD, and SPC, as well as all characters with Unicode property “Pattern_Whitespace” including the Unicode-specific LINE SEPARATOR (8232).

small-letter

These are character codes 97..122, i.e. the letters ‘a’ through ‘z’, as well as the non-ASCII character codes 170, 186, 223..246, and 248..255.

If the Prolog flag legacy_char_classification (see ref-lps-flg) is set, then the small-letter set will also include almost every code point above 255 assigned to non-private abstract characters in Unicode 5.0.

capital-letter

These are character codes 65..90, i.e. the letters ‘A’ through ‘Z’, as well as the non-ASCII character codes 192..214, and 216..222.

digit

These are character codes 48..57, i.e. the digits ‘0’ through ‘9’.

symbol-char

These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:

+ - * / \ ^ < > = ~ : . ? @ # $ & 

In addition, the non-ASCII character codes 161..169, 171..185, 187..191, 215, and 247 belong to this character type4.

solo-char

These are character codes 33 and 59 i.e. the characters ‘!’ and ‘;’.

punctuation-char

These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters:

% ( ) , [ ] { | }
quote-char

These are character codes 34 and 39 i.e. the characters ‘"’ and ‘'’.

underline

This is character code 95 i.e. the character ‘_’.

Other characters are unclassified and may only appear in comments and to some extent, as discussed above, in quoted atoms and strings.

token::= name
| natural-number
| unsigned-float
| variable
| string
| punctuation-char
| whitespace-text
| full stop
name::= quoted-name
| word
| symbol
| solo-char
| [ ?whitespace-text ]
| { ?whitespace-text }
word::= small-letter ?alpha…
symbol::= symbol-char…{ except in the case of a full stop or where the first 2 chars are ‘/*’ }
natural-number::= digit…
| base-prefix alpha…{ where each alpha must be digits of the base indicated by base-prefix, treating a,b,… and A,B,… as 10,11,… }
| 0 ' char-item{ yielding the character code for char }
unsigned-float::= simple-float
| simple-float exp exponent
simple-float::= digit… . digit…
exp::= e | E
exponent::= digit… | sign digit…
sign::= - | +
variable::= underline ?alpha…
| capital-letter ?alpha…
string::= " ?string-item… "
string-item::= quoted-char{ other than ‘"’ or ‘\’ }
| ""
| \ escape-sequence
quoted-atom::= ' ?quoted-item… '
quoted-item::= quoted-char{ other than ‘'’ or ‘\’ }
| ''
| \ escape-sequence
whitespace-text::= whitespace-text-item…
whitespace-text-item::= whitespace-char | comment
comment::= /* ?char… */{ where ?char… must not contain ‘*/’ }
| % ?char… LFD{ where ?char… must not contain LFD }
full stop::= .{ the following token, if any, must be whitespace-text}
char::= whitespace-char
| printing-char
printing-char::= alpha
| symbol-char
| solo-char
| punctuation-char
| quote-char
alpha::= capital-letter | small-letter | digit | underline
escape-sequence::= b{ backspace, character code 8 }
| t{ horizontal tab, character code 9 }
| n{ newline, character code 10 }
| v{ vertical tab, character code 11 }
| f{ form feed, character code 12 }
| r{ carriage return, character code 13 }
| e{ escape, character code 27 }
| d{ delete, character code 127 }
| a{ alarm, character code 7 }
| other-escape-sequence
quoted-name::= quoted-atom
base-prefix::= 0b{ indicates base 2 }
| 0o{ indicates base 8 }
| 0x{ indicates base 16 }
char-item::= quoted-item
other-escape-sequence::= x alpha… \{treating a,b,… and A,B,… as 10,11,… } in the range [0..15], hex character code }
| digit… \{ in the range [0..7], octal character code }
| LFD{ ignored }
| \{ stands for itself }
| '{ stands for itself }
| "{ stands for itself }
| `{ stands for itself }
quoted-char::= SPC
| printing-char

Footnotes

(3)

Characters outside this range can still be included in quoted atoms and strings by using escape sequences (see ref-syn-syn-esc).

(4)

In release 3 and 4.0.0 the lower case characters 170 and 186 were incorrectly classified as symbol-char. This was corrected in release 4.0.1.



Send feedback on this subject.