Draft 2002-09-05

Chapter 1

Language Rules

C++ is a case-sensitive, freeform programming language. This chapter presents the lexical rules for the language.

Character Sets

The character sets that C++ uses at compile time and at runtime are implementation-defined. The compile-time character set, called the basic source character set, must include the characters listed in Table 1-1. The numeric values of these characters and the mapping from the characters found in a source file to the basic source character set are implementation-defined.

Table 1-1: Basic source character set
space character
horizontal tab control character
vertical tab
form feed
new line
a ... z
A ... Z
0 ... 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

The runtime character set, called the execution character set, might be different from the source character set (though it is often the same). If the character sets are different, all character and string literals are automatically converted from the source character set to the execution character set. The basic execution character set includes all the characters in the basic source character set, plus the characters listed in Table 1-2. The execution character set is a superset of the basic execution character set; additional characters are implemented-defined and might vary depending on locale.

Table 1-2: Additional characters in the basic execution character set
alert
backspace
carriage return
null

Conceptually, source characters are mapped to Unicode (ISO/IEC 10646) and from Unicode to the execution character set. You can specify any Unicode character in the source file as a universal character, in the form \uXXXX (lowercase u) or \UXXXXXXXX (uppercase U), where 0000XXXX or XXXXXXXX is the hexadecimal value for the character. Note that you must use exactly four or eight hexadecimal digits. You cannot use a universal character to specify any character that is in the basic source character set. or any character in the range 0-0x20 or0x7F-0x9F (inclusive). How universal characters are mapped to the execution character set is implementation-defined.

The numerical values for characters are implementation-defined, with the following restrictions:

The space, horizontal tab, vertical tab, form feed, and new line characters are called white space characters. In most cases, white space characters serve only to separate tokens and are otherwise ignored. (Comments are like white space; see the section, Comments, later in this chapter.)

Multibyte and Wide Characters

The execution character set is actually two different character sets: narrow and wide. Any character in the narrow execution character set can be widened (see the <locale> header in Chapter 13) to get its equivalent character in the execution wide-character set. Narrowing the wide character must yield the original narrow character.

A character set might be a multibyte character set, in which multiple adjacent physical characters represent a single logical character. For example, an implementation might use UTF-32 as the execution wide-character set, where the wchar_t type requires 32 bits, and the values are Unicode. The execution narrow-character set might be UTF-8, where the char type requires 8 bits, and a logical character can occupy from 1 to 4 physical characters. (Characters in the range 0-0x7F require 1 byte; characters in the range 0x80-0x7FF require 2 bytes, and so on.)

When converting a multibyte character string to a wide character string, the conversion process might need to keep track of which physical character in a multibyte sequence is being processed. This is known as a shift state. For example, when converting the UTF-8 character sequence 0x

A multibyte character set might require a shift state, which keeps track of which physical character in a multibyte character sequence 0xE1, 0xA0, 0xA0 to a a wide character, a pointer to the narrow character string first touches 0xE1, which initializes the shift state. The next character is 0xA0. A pointer to this character alone is not enough information to tell whether the multibyte sequence has ended. The converter needs the shift state to know that one more character is needed. After reading the third narrow character, the converter can assemble the pieces and produce a single wide character with value 0x1820 (Mongolian letter a, if you must know).

Other character sets do not require shift states. For example, some oriental character sets use two bytes to represent a single character. The most significant bit of each byte signifies whether the byte is the first or second byte of a double-byte character. Thus, no additional state information is required. Some of the locale and I/O functions behave differently depending on whether a shift state is required. (Details are in Chapter 13.)

Trigraphs

A few characters have an alternative representation, called trigraph sequences. A trigraph is a three-character sequence that represents a single character. The sequence always starts with two question marks. The third character determines which character the sequence represents. All the trigraph sequences are shown in Table 1-3. If the third character is not of those in the table, the sequence is not a trigraph. For example, the characters ???- represents the two characters ?~. Note that trigraphs are expanded anywhere they appear, including strings, characters, and preprocessor directives.

Table 1-3: Trigraph sequences.
Trigraph Replacement
??= #
??/ \
??' ^
??( [
??) ]
??! |
??< {
??> }
??- ~

Not all compilers support trigraphs. Some compilers require an extra switch or option. Others use a separate program to convert all trigraphs to their equivalent characters.

Comments

Comments start with /* and end with */. These comments do not nest. A comment can also start with //, extending to the end of the line.

Within a /* and */ comment, // characters have no special meaning. Within a // comments, /* and */ have no special meaning. Thus, you can &ldquo;nest&rdquo; one kind of comment in the other kind.

A comment is treated as white space.

Tokens

All input is divided into a stream of tokens. The compiler tries to collect as many contiguous characters as it can to build a valid token. It stops when the next character it would read cannot possibly be part of the token it is reading.

A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described in this section in more detail.

Identifiers

An identifier is a name that the programmer defines or that is defined in a library. An identifier begins with a nondigit character and is followed by any number of digits and nondigits. A nondigit character is a letter, an underscore or one of a set of universal characters. The exact set of nondigit universal characters is defined in the C++ standard and in ISO/IEC PDTR 10176. Briefly, it contains the universal characters that represent letters in various languages.

Certain identifiers are reserved for use by the standard library:

Keywords

A keyword is an identifier that is reserved in all contexts for special use by the language. Table 1-4 lists all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords as keywords; these compilers will allow you to use certain keywords as identifiers. See the section, Alternative Tokens, at the end of this chapter for more information.)

Table 1-4: Reserved keywords
and compl export namespace return try xor
and_eq const extern new short typedef xor_eq
asm const_cast false not signed typeid  
auto continue float not_eq sizeof typename  
bitand default for operator static union  
bitor delete friend or static_cast unsigned  
bool do goto or_eq struct using  
break double if private switch virtual  
case dynamic_cast inline protected template void  
catch else int public this volatile  
char enum long register throw wchar_t  
class explicit mutable reintepret_cast true while  

Literals

A literal is an integer, floating point, Boolean, character, or string constant.

Integer literals

An integer literal can be a decimal, octal, or hexadecimal constant. A prefix specifies the base or radix: 0x or 0X for hexadecimal, 0 for octal, and nothing for decimal. The literal can also have a suffix that is a combination of U and L, for unsigned and long, respectively. The suffix can be upper or lowercase and can be in any order.

Following are some examples of integer literals:

314    // legal
314u   // legal
314LU  // legal
0xFeeL // legal
0ul    // legal
078    // illegal: 8 is not an octal digit
032UU  // illegal: cannot repeat a suffix

Floating point literals

A floating point literal has an integer part, a decimal point, a fractional part, and an exponent part. You must include the decimal point, the exponent, or both. You must include the integer part, the fractional part, or both. The signed exponent is introduced by e or E. The literal&rsquo;s type is double unless there is a suffix: F for type float and L for long double. The suffix can be uppercase or lowercase.

Following are some examples of floating point literals:

3.14159    // legal
.314159F   // legal
314159E-5L // legal
314.       // legal
314E       // illegal: incomplete exponent
314f       // illegal: no decimal or exponent
.e24       // illegal: missing integer or fraction

Boolean literals

There are two Boolean literals, both keywords: true and false.

Character literals

Character literals are enclosed in single quotes. A character literal has type char. (Note that in C, a character literal has type int. You can often ignore the distinction, but not always.)

A character literal can be a plain character (e.g., 'x'), an escape sequence, or a universal character. Table 1-5 lists the possible escape sequences. Note that you must use an escape sequence for a backslash or single quote character literal. Using an escape for a double quote or question mark is optional. Only the characters shown in Table 1-5 are allowed in an escape sequence.

Table 1-5: Character escape sequences
Escape sequence Meaning
\\ \ character
\' ' character
\? ? character (used to avoid creating a trigraph, e.g., \?\?-)
\" " character
\a Alert or bell
\b Backspace
\f Form feed
\n New line
\r Carriage return
\t Horizontal tab
\v Vertical tab
\ooo Octal number of 1 to 3 digits
\xhh... Hexadecimal number of 1 or more digits

Wide character literals are enclosed in single quotes, with an initial L (must be uppercase), and have type wchar_t. In contrast to wide characters, ordinary characters are sometimes called narrow characters.

The value of a character literal depends on the execution character set. A wide character is an integral type that is large enough to hold any and all possible execution characters for any locale. There is no guarantee that wchar_t is big enough for all universal characters (\uXXXX or \UXXXXXXXX) or even that wchar_t is bigger than char.

String literals

String literals are enclosed in double quotes. A string contains characters, similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary, but it can contain escaped line endings (backslash followed by new line).

A wide string literal is prefaced with L (always uppercase). In a wide string literal, a single universal character always maps to a single wide character. In a narrow string literal, a universal character might map to multiple characters (called a multibyte character).

Two adjacent string literals (possibly separated by white space) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string this way.

After concatenating adjacent strings, the character '\0' is automatically appended after the last character in the string literal.

Following are some examples of string literals:

"hello"
"hello, \
reader"
"hello, " "rea" "der"
"Alert: \a; ASCII tab: \010; portable tab: \t"
"illegal: unterminated string
L"string with \"quotes\""

Symbols

Non-alphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols have multiple adjacent characters. Table 1-6 lists all the symbols used for operators and punctuation.

Table 1-6: Operator and punctuation symbols
{ # <: %: ... - ^ : ->* == <= <<= *= |=
} ## :> %:%: , * & :: ~ != >= >>= /= ^=
[ ( <% ; . / | .* ! < << -= %= ++
] ) %> : + % ? -> = > >> += &= --

You cannot insert white space between the symbol characters, and C++ always collects as many characters as it can to form a symbol before trying to interpret the symbol. Thus, an expressions such as x+++y is read as x ++ + y. A common error when first using templates is to omit a space between closing angle brackets in a nested template instantiation, e.g.,

std::list<std::vector<int> > l;
//     note the space here^

The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right shift operator, not as two separate closing angle brackets. Another error, slightly less common, is instantiating a template with a template argument that uses the global scope operators:

::std::list< ::std::list<int> > l;
//          ^space here   and^ here

Again, a space is needed to prevent the compiler from accumulating too many characters into a single token. In this case, the token is <:, which is an alternative token, as described in the next section.

Alternative Tokens

Some symbols have multiple representations, as shown in Table 1-7. Unlike trigraphs, the alternative tokens have no special meaning in a character or string literal. There are merely alternative spellings for common symbols.

Table 1-7: Alternative tokens
Alternative Primary token Alternative Primary token
<% { bitand &
%> } bitor |
<: [ compl ~
:> ] not !
%: # not_eq !=
%:%: ## or ||
and && or_eq |=
and_eq &= xor ^
    xor_eq ^=

Many compilers do not support all of the alternative tokens. In particular, some do not treat alternative keywords (and, or, etc.) as reserved keywords, but allow you to use them as identifiers. This is wrong, and clearly violates the C++ standard. Nonetheless, a few major compilers ignore this part of the standard, and you must be aware of the problem. Fortunately, this problem is becoming less common as more vendors hew closer to the standard.