SICStus Prolog

Node:Token String, Next:Escape Sequences, Previous:Term Token, Up:Full Syntax

Syntax of Tokens as Character Strings

SICStus Prolog supports wide characters (up to 31 bits wide). It is assumed that the character code set is an extension of (7 bit) ASCII, i.e. that it includes the codes 0..127 and these codes are interpreted as ASCII characters.

Each character in the code set has to be classified as belonging to one of the character categories, such as small-letter, digit, etc. This classification is called the character-type mapping, and it is used for defining the syntax of tokens.

The user can select one of the three predefined wide character modes through the environment variable SP_CTYPE. These modes are iso_8859_1, utf8, and euc. The user can also define other wide character modes by plugging in appropriate hook functions; see Handling Wide Characters. In this case the user has to supply a character-type mapping for the codes greater than 127.

We first describe the character-type mapping for the fixed part of the code set, the 7 bit ASCII.

layout-char: These are character codes 0..32 and 127. This includes characters such as <TAB>, <LFD>, and <SPC>.
small-letter: These are character codes 97..122, i.e. the letters a through z.
capital-letter: These are character codes 65..90, i.e. the letters A through Z.
digit: These are character codes 48..57, i.e. the digits 0 through 9.
symbol-char: These are character codes 35, 36, 38, 42, 43, 45..47, 58, 60..64, 92, 94, and 126, i.e. the characters:
+-*/\^<>=~:.?@#$&
In sicstus execution mode, character code 96 (`) is also a symbol-char.
solo-char: These are character codes 33 and 59 i.e. the characters ! and ;.
punctuation-char: These are character codes 37, 40, 41, 44, 91, 93, and 123..125, i.e. the characters %(),[]{|}.
quote-char: These are character codes 34 and 39 i.e. the characters " and '. In iso execution mode character code 96 (`) is also a quote-char.
underline: This is character code 95 i.e. the character _.

We now provide the character-type mapping for the characters above the 7 bit ASCII range, for each of the built-in wide character modes.

The iso_8859_1 mode has the character set 0..255 and the following character-type mapping for the codes 128..255:

layout-char: the codes 128..160.
small-letter: the codes 223..246, and 248..255.
capital-letter: the codes 192..214, and 216..222.
symbol-char: the codes 161..191, 215, and 247.

The utf8 mode has the character set 0..(2^31-1). The character-type mapping for the codes 128..255 is the same as for the iso_8859_1 mode. All character codes above 255 are classified as small-letters.

The euc mode character set is described in Representation of EUC Wide Characters. All character codes above 127 are classified as small-letters.

     token             --> name
                        |  natural-number
                        |  unsigned-float
                        |  variable
                        |  string
                        |  punctuation-char
                        |  layout-text
                        |  full-stop
     
     name              --> quoted-name
                        |  word
                        |  symbol
                        |  solo-char
                        |  [ ?layout-text ]
                        |  { ?layout-text }
     
     quoted-item       --> char  { other than ' or \ }
                        |  ''
                        |  \ escape-sequence  {unless character escapes have been switched off }
     
     word              --> small-letter ?alpha...
     
     symbol            --> symbol-char...
                              { except in the case of a full-stop
                                or where the first 2 chars are /* }
     
     natural-number    --> digit...
                        |  base-prefix alpha...
                              { where each alpha must be digits of }
                              {the base indicated by base-prefix,
                              treating a,b,... and A,B,... as 10,11,... }
                        |  0 ' char-item
                              { yielding the character code for char }
     
     unsigned-float    --> simple-float
                        |  simple-float exp exponent
     
     simple-float      --> digit... . digit...
     
     exp               --> e  |  E
     
     exponent          --> digit... | sign digit...
     
     sign              --> - | +
     
     variable          --> underline ?alpha...
                        |  capital-letter ?alpha...
     
     string            --> " ?string-item... "
     
     string-item       --> char  { other than " or \ }
                        |  ""
                        |  \ escape-sequence {unless character escapes have been switched off }
     
     layout-text             --> layout-text-item...
     
     layout-text-item        --> layout-char | comment
     
     comment           --> /* ?char... */
                              { where ?char... must not contain */ }
                        |  % ?char... <LFD>
                              { where ?char... must not contain <LFD> }
     
     full-stop         --> .
                              { the following token, if any, must be layout-text}
     
     char              --> { any character, i.e. }
                           layout-char
                        |  alpha
                        |  symbol-char
                        |  solo-char
                        |  punctuation-char
                        |  quote-char
     
     alpha             --> capital-letter | small-letter | digit | underline
     
     escape-sequence   --> b        { backspace, character code 8 }
                        |  t        { horizontal tab, character code 9 }
                        |  n        { newline, character code 10 }
                        |  v        { vertical tab, character code 11 }
                        |  f        { form feed, character code 12 }
                        |  r        { carriage return, character code 13 }
                        |  e        { escape, character code 27 }
                        |  d        { delete, character code 127 }
                        |  a        { alarm, character code 7 }
                        |  other-escape-sequence

There are differences between the syntax used in iso mode and in sicstus mode. The differences are described by providing different syntax rules for certain syntactic categories.

`iso` execution mode rules

     quoted-name       --> ' ?quoted-item... '
                        |  backquoted-atom
     
     backquoted-atom   -->
                        |  ` ?backquoted-item... `
     
     backquoted-item   --> char  { other than ` or \ }
                        |  ``
                        |  \ escape-sequence  {unless character escapes have been switched off }
     
     base-prefix       -->   0b { indicates base  2 }
                        |  0o { indicates base  8 }
                        |  0x { indicates base 16 }
     
     char-item         --> quoted-item
     
     other-escape-sequence  -->
                           x alpha... \
                              {treating a,b,... and A,B,... as 10,11,... }
                                    { in the range [0..15], hex character code }
                        |  o digit... \
                                    { in the range [0..7], octal character code }
                        |  c <LFD>    { ignored }
                        |  \
                        |  '
                        |  "
                        |  `
                                    { represent themselves }

`sicstus` execution mode rules

     quoted-name       --> ' ?quoted-item... '
     
     base-prefix       --> base ' {indicates base base }
     
     base              --> digit...  { in the range [2..36] }
     
     char-item         --> char  { other than \ }
                        |  \ escape-sequence  {unless character escapes have been switched off }
     
     other-escape-sequence  -->
                        |  x alpha alpha escape-terminator
                              {treating a,b,... and A,B,... as 10,11,... }
                                    { in the range [0..15], hex character code }
                        |  digit ?digit ?digit escape-terminator
                                    { in the range [0..7], octal character code }
                        |  ^ ?      { delete, character code 127 }
                        |  ^ capital-letter
                        |  ^ small-letter
                                    { the control character alpha mod 32 }
                        |  c ?layout-char... { ignored }
                        |  layout-char  { ignored }
                        |  char    { other than the above, represents itself }

Syntax of Tokens as Character Strings

iso execution mode rules

sicstus execution mode rules

`iso` execution mode rules

`sicstus` execution mode rules