UNB/ CS/ David Bremner/ teaching/ cs2613/ books/ mdn/ Reference/ Lexical grammar

This page describes JavaScript's lexical grammar. JavaScript source text is just a sequence of characters — in order for the interpreter to understand it, the string has to be parsed to a more structured representation. The initial step of parsing is called lexical analysis, in which the text gets scanned from left to right and is converted into a sequence of individual, atomic input elements. Some input elements are insignificant to the interpreter, and will be stripped after this step — they include white space and comments. The others, including identifiers, keywords, literals, and punctuators (mostly operators), will be used for further syntax analysis. Line terminators and multiline comments are also syntactically insignificant, but they guide the process for automatic semicolons insertion to make certain invalid token sequences become valid.

Format-control characters

Format-control characters have no visual representation but are used to control the interpretation of the text.

Code point Name Abbreviation Description
U+200C Zero width non-joiner \ Placed between characters to prevent being connected into ligatures in certain languages (Wikipedia).
U+200D Zero width joiner \ Placed between characters that would not normally be connected in order to cause the characters to be rendered using their connected form in certain languages (Wikipedia).
U+FEFF Byte order mark \ Used at the start of the script to mark it as Unicode and the text's byte order (Wikipedia).

In JavaScript source text, \ and \ are treated as identifier parts, while \ (also called a zero-width no-break space \ when not at the start of text) is treated as white space.

White space

White space characters improve the readability of source text and separate tokens from each other. These characters are usually unnecessary for the functionality of the code. Minification tools are often used to remove whitespace in order to reduce the amount of data that needs to be transferred.

Code point Name Abbreviation Description Escape sequence
U+0009 Character tabulation \ Horizontal tabulation \t
U+000B Line tabulation \ Vertical tabulation \v
U+000C Form feed \ Page breaking control character (Wikipedia). \f
U+0020 Space \ Normal space
U+00A0 No-break space \ Normal space, but no point at which a line may break
U+FEFF Zero-width no-break space \ When not at the start of a script, the BOM marker is a normal whitespace character.
Others Other Unicode space characters \ Characters in the "Space_Separator" general category

Note: Of those characters with the "White_Space" property but are not in the "Space_Separator" general category, U+0009, U+000B, and U+000C are still treated as white space in JavaScript; U+0085 NEXT LINE has no special role; others become the set of line terminators.

Note: Changes to the Unicode standard used by the JavaScript engine may affect programs' behavior. For example, ES2016 upgraded the reference Unicode standard from 5.1 to 8.0.0, which caused U+180E MONGOLIAN VOWEL SEPARATOR to be moved from the "Space_Separator" category to the "Format (Cf)" category, and made it a non-whitespace. Subsequently, the result of "\u180E".trim().length changed from 0 to 1.

Line terminators

In addition to white space characters, line terminator characters are used to improve the readability of the source text. However, in some cases, line terminators can influence the execution of JavaScript code as there are a few places where they are forbidden. Line terminators also affect the process of automatic semicolon insertion.

Outside the context of lexical grammar, white space and line terminators are often conflated. For example, String.prototype.trim removes all white space and line terminators from the beginning and end of a string. The \s character class escape in regular expressions matches all white space and line terminators.

Only the following Unicode code points are treated as line terminators in ECMAScript, other line breaking characters are treated as white space (for example, Next Line, NEL, U+0085 is considered as white space).

Code point Name Abbreviation Description Escape sequence
U+000A Line Feed \ New line character in UNIX systems. \n
U+000D Carriage Return \ New line character in Commodore and early Mac systems. \r
U+2028 Line Separator \ Wikipedia
U+2029 Paragraph Separator \ Wikipedia

Comments

Comments are used to add hints, notes, suggestions, or warnings to JavaScript code. This can make it easier to read and understand. They can also be used to disable code to prevent it from being executed; this can be a valuable debugging tool.

JavaScript has two long-standing ways to add comments to code: line comments and block comments. In addition, there's a special hashbang comment syntax.

Line comments

The first way is the // comment; this makes all text following it on the same line into a comment. For example:

function comment() {
  // This is a one line JavaScript comment
  console.log("Hello world!");
}
comment();

Block comments

The second way is the /* */ style, which is much more flexible.

For example, you can use it on a single line:

function comment() {
  /* This is a one line JavaScript comment */
  console.log("Hello world!");
}
comment();

You can also make multiple-line comments, like this:

function comment() {
  /* This comment spans multiple lines. Notice
     that we don't need to end the comment until we're done. */
  console.log("Hello world!");
}
comment();

You can also use it in the middle of a line, if you wish, although this can make your code harder to read so it should be used with caution:

function comment(x) {
  console.log("Hello " + x /* insert the value of x */ + " !");
}
comment("world");

In addition, you can use it to disable code to prevent it from running, by wrapping code in a comment, like this:

function comment() {
  /* console.log("Hello world!"); */
}
comment();

In this case, the console.log() call is never issued, since it's inside a comment. Any number of lines of code can be disabled this way.

Block comments that contain at least one line terminator behave like line terminators in automatic semicolon insertion.

Hashbang comments

There's a special third comment syntax, the hashbang comment. A hashbang comment behaves exactly like a single line-only (//) comment, except that it begins with #! and is only valid at the absolute start of a script or module. Note also that no whitespace of any kind is permitted before the #!. The comment consists of all the characters after #! up to the end of the first line; only one such comment is permitted.

Hashbang comments in JavaScript resemble shebangs in Unix which provide the path to a specific JavaScript interpreter that you want to use to execute the script. Before the hashbang comment became standardized, it had already been de-facto implemented in non-browser hosts like Node.js, where it was stripped from the source text before being passed to the engine. An example is as follows:

#!/usr/bin/env node

console.log("Hello world");

The JavaScript interpreter will treat it as a normal comment — it only has semantic meaning to the shell if the script is directly run in a shell.

Warning: If you want scripts to be runnable directly in a shell environment, encode them in UTF-8 without a BOM. Although a BOM will not cause any problems for code running in a browser — because it's stripped during UTF-8 decoding, before the source text is analyzed — a Unix/Linux shell will not recognize the hashbang if it's preceded by a BOM character.

You must only use the #! comment style to specify a JavaScript interpreter. In all other cases just use a // comment (or multiline comment).

Identifiers

An identifier is used to link a value with a name. Identifiers can be used in various places:

const decl = 1; // Variable declaration (may also be `let` or `var`)
function fn() {} // Function declaration
const obj = { key: "value" }; // Object keys
// Class declaration
class C {
  #priv = "value"; // Private property
}
lbl: console.log(1); // Label

In JavaScript, identifiers are commonly made of alphanumeric characters, underscores (_), and dollar signs ($). Identifiers are not allowed to start with numbers. However, JavaScript identifiers are not only limited to — many Unicode code points are allowed as well. Namely, any character in the ID_Start category can start an identifier, while any character in the ID_Continue category can appear after the first character.

Note: If, for some reason, you need to parse some JavaScript source yourself, do not assume all identifiers follow the pattern /[A-Za-z_$][\w$]*/ (i.e. ASCII-only)! The range of identifiers can be described by the regex /[$_\p{ID_Start}][$\u200c\u200d\p{ID_Continue}]*/u (excluding unicode escape sequences).

In addition, JavaScript allows using Unicode escape sequences in the form of \u0000 or \u{000000} in identifiers, which encode the same string value as the actual Unicode characters. For example, 你好 and \u4f60\u597d are the same identifiers:

const 你好 = "Hello";
console.log(\u4f60\u597d); // Hello

Not all places accept the full range of identifiers. Certain syntaxes, such as function declarations, function expressions, and variable declarations require using identifiers names that are not reserved words.

function import() {} // Illegal: import is a reserved word.

Most notably, private properties and object properties allow reserved words.

const obj = { import: "value" }; // Legal despite `import` being reserved
class C {
  #import = "value";
}

Keywords

Keywords are tokens that look like identifiers but have special meanings in JavaScript. For example, the keyword async before a function declaration indicates that the function is asynchronous.

Some keywords are reserved, meaning that they cannot be used as an identifier for variable declarations, function declarations, etc. They are often called reserved words. A list of these reserved words is provided below. Not all keywords are reserved — for example, async can be used as an identifier anywhere. Some keywords are only contextually reserved — for example, await is only reserved within the body of an async function, and let is only reserved in strict mode code, or const and let declarations.

Identifiers are always compared by string value, so escape sequences are interpreted. For example, this is still a syntax error:

const els\u{65} = 1;
// `els\u{65}` encodes the same identifier as `else`

Reserved words

These keywords cannot be used as identifiers for variables, functions, classes, etc. anywhere in JavaScript source.

The following are only reserved when they are found in strict mode code:

The following are only reserved when they are found in module code or async function bodies:

Future reserved words

The following are reserved as future keywords by the ECMAScript specification. They have no special functionality at present, but they might at some future time, so they cannot be used as identifiers.

These are always reserved:

The following are only reserved when they are found in strict mode code:

Future reserved words in older standards

The following are reserved as future keywords by older ECMAScript specifications (ECMAScript 1 till 3).

Identifiers with special meanings

A few identifiers have a special meaning in some contexts without being reserved words of any kind. They include:

Literals

Note: This section discusses literals that are atomic tokens. Object literals and array literals are expressions that consist of a series of tokens.

Null literal

See also null for more information.

null

Boolean literal

See also boolean type for more information.

true
false

Numeric literals

The Number and BigInt types use numeric literals.

Decimal

1234567890
42

Decimal literals can start with a zero (0) followed by another decimal digit, but if all digits after the leading 0 are smaller than 8, the number is interpreted as an octal number. This is considered a legacy syntax, and number literals prefixed with 0, whether interpreted as octal or decimal, cause a syntax error in strict mode — so, use the 0o prefix instead.

0888 // 888 parsed as decimal
0777 // parsed as octal, 511 in decimal
Exponential

The decimal exponential literal is specified by the following format: beN; where b is a base number (integer or floating), followed by an E or e character (which serves as separator or exponent indicator) and N, which is exponent or power number – a signed integer.

0e-5   // 0
0e+5   // 0
5e1    // 50
175e-2 // 1.75
1e3    // 1000
1e-3   // 0.001
1E3    // 1000

Binary

Binary number syntax uses a leading zero followed by a lowercase or uppercase Latin letter "B" (0b or 0B). Any character after the 0b that is not 0 or 1 will terminate the literal sequence.

0b10000000000000000000000000000000 // 2147483648
0b01111111100000000000000000000000 // 2139095040
0B00000000011111111111111111111111 // 8388607

Octal

Octal number syntax uses a leading zero followed by a lowercase or uppercase Latin letter "O" (0o or 0O). Any character after the 0o that is outside the range (01234567) will terminate the literal sequence.

0O755 // 493
0o644 // 420

Hexadecimal

Hexadecimal number syntax uses a leading zero followed by a lowercase or uppercase Latin letter "X" (0x or 0X). Any character after the 0x that is outside the range (0123456789ABCDEF) will terminate the literal sequence.

0xFFFFFFFFFFFFFFFFF // 295147905179352830000
0x123456789ABCDEF   // 81985529216486900
0XA                 // 10

BigInt literal

The BigInt type is a numeric primitive in JavaScript that can represent integers with arbitrary precision. BigInt literals are created by appending n to the end of an integer.

123456789123456789n     // 123456789123456789
0o777777777777n         // 68719476735
0x123456789ABCDEFn      // 81985529216486895
0b11101001010101010101n // 955733

BigInt literals cannot start with 0 to avoid confusion with legacy octal literals.

0755n; // SyntaxError: invalid BigInt syntax

For octal BigInt numbers, always use zero followed by the letter "o" (uppercase or lowercase):

0o755n;

For more information about BigInt, see also JavaScript data structures.

Numeric separators

To improve readability for numeric literals, underscores (_, U+005F) can be used as separators:

1_000_000_000_000
1_050.95
0b1010_0001_1000_0101
0o2_2_5_6
0xA0_B0_C0
1_000_000_000_000_000_000_000n

Note these limitations:

// More than one underscore in a row is not allowed
100__000; // SyntaxError

// Not allowed at the end of numeric literals
100_; // SyntaxError

// Can not be used after leading 0
0_1; // SyntaxError

String literals

A string literal is zero or more Unicode code points enclosed in single or double quotes. Unicode code points may also be represented by an escape sequence. All code points may appear literally in a string literal except for these code points:

Any code points may appear in the form of an escape sequence. String literals evaluate to ECMAScript String values. When generating these String values Unicode code points are UTF-16 encoded.

'foo'
"bar"

The following subsections describe various escape sequences (\ followed by one or more characters) available in string literals. Any escape sequence not listed below becomes an "identity escape" that becomes the code point itself. For example, \z is the same as z. There's a deprecated octal escape sequence syntax described in the Deprecated and obsolete features page. Many of these escape sequences are also valid in regular expressions — see Character escape.

Escape sequences

Special characters can be encoded using escape sequences:

Escape sequence Unicode code point
\0 null character (U+0000 NULL)
\' single quote (U+0027 APOSTROPHE)
\" double quote (U+0022 QUOTATION MARK)
\\ backslash (U+005C REVERSE SOLIDUS)
\n newline (U+000A LINE FEED; LF)
\r carriage return (U+000D CARRIAGE RETURN; CR)
\v vertical tab (U+000B LINE TABULATION)
\t tab (U+0009 CHARACTER TABULATION)
\b backspace (U+0008 BACKSPACE)
\f form feed (U+000C FORM FEED)
\ followed by a line terminator empty string

The last escape sequence, \ followed by a line terminator, is useful for splitting a string literal across multiple lines without changing its meaning.

const longString =
  "This is a very long string which needs \
to wrap across multiple lines because \
otherwise my code is unreadable.";

Make sure there is no space or any other character after the backslash (except for a line break), otherwise it will not work. If the next line is indented, the extra spaces will also be present in the string's value.

You can also use the + operator to append multiple strings together, like this:

const longString =
  "This is a very long string which needs " +
  "to wrap across multiple lines because " +
  "otherwise my code is unreadable.";

Both of the above methods result in identical strings.

Hexadecimal escape sequences

Hexadecimal escape sequences consist of \x followed by exactly two hexadecimal digits representing a code unit or code point in the range 0x0000 to 0x00FF.

"\xA9"; // "©"

Unicode escape sequences

A Unicode escape sequence consists of exactly four hexadecimal digits following \u. It represents a code unit in the UTF-16 encoding. For code points U+0000 to U+FFFF, the code unit is equal to the code point. Code points U+10000 to U+10FFFF require two escape sequences representing the two code units (a surrogate pair) used to encode the character; the surrogate pair is distinct from the code point.

See also String.fromCharCode and String.prototype.charCodeAt.

"\u00A9"; // "©" (U+A9)

Unicode code point escapes

A Unicode code point escape consists of \u{, followed by a code point in hexadecimal base, followed by }. The value of the hexadecimal digits must be in the range 0 and 0x10FFFF inclusive. Code points in the range U+10000 to U+10FFFF do not need to be represented as a surrogate pair.

See also String.fromCodePoint and String.prototype.codePointAt.

"\u{2F804}"; // CJK COMPATIBILITY IDEOGRAPH-2F804 (U+2F804)

// the same character represented as a surrogate pair
"\uD87E\uDC04";

Regular expression literals

Regular expression literals are enclosed by two forward slashes (/). The lexer consumes all characters up to the next unescaped forward slash or the end of the line, unless the forward slash appears within a character class ([]). Some characters (namely, those that are identifier parts) can appear after the closing slash, denoting flags.

The lexical grammar is very lenient: not all regular expression literals that get identified as one token are valid regular expressions.

See also RegExp for more information.

/ab+c/g
/[/]/

A regular expression literal cannot start with two forward slashes (//), because that would be a line comment. To specify an empty regular expression, use /(?:)/.

Template literals

One template literal consists of several tokens: `xxx${ (template head), }xxx${ (template middle), and }xxx` (template tail) are individual tokens, while any expression may come between them.

See also template literals for more information.

`string text`

`string text line 1
 string text line 2`

`string text ${expression} string text`

tag`string text ${expression} string text`

Automatic semicolon insertion

Some JavaScript statements' syntax definitions require semicolons (;) at the end. They include:

However, to make the language more approachable and convenient, JavaScript is able to automatically insert semicolons when consuming the token stream, so that some invalid token sequences can be "fixed" to valid syntax. This step happens after the program text has been parsed to tokens according to the lexical grammar. There are three cases when semicolons are automatically inserted:

1. When a token not allowed by the grammar is encountered, and it's separated from the previous token by at least one line terminator (including a block comment that includes at least one line terminator), or the token is "}", then a semicolon is inserted before the token.

{ 1
2 } 3

// is transformed by ASI into:

{ 1
;2 ;} 3;

// Which is valid grammar encoding three statements,
// each consisting of a number literal

The ending ")" of do...while is taken care of as a special case by this rule as well.

do {
  // ...
} while (condition) /* ; */ // ASI here
const a = 1

However, semicolons are not inserted if the semicolon would then become the separator in the for statement's head.

for (
  let a = 1 // No ASI here
  a < 10 // No ASI here
  a++
) {}

Semicolons are also never inserted as empty statements. For example, in the code below, if a semicolon is inserted after ")", then the code would be valid, with an empty statement as the if body and the const declaration being a separate statement. However, because automatically inserted semicolons cannot become empty statements, this causes a declaration to become the body of the if statement, which is not valid.

if (Math.random() > 0.5)
const x = 1 // SyntaxError: Unexpected token 'const'

2. When the end of the input stream of tokens is reached, and the parser is unable to parse the single input stream as a complete program, a semicolon is inserted at the end.

const a = 1 /* ; */ // ASI here

This rule is a complement to the previous rule, specifically for the case where there's no "offending token" but the end of input stream.

3. When the grammar forbids line terminators in some place but a line terminator is found, a semicolon is inserted. These places include:

Here ++ is not treated as a postfix operator applying to variable b, because a line terminator occurs between b and ++.

a = b
++c

// is transformed by ASI into

a = b;
++c;

Here, the return statement returns undefined, and the a + b becomes an unreachable statement.

return
a + b

// is transformed by ASI into

return;
a + b;

Note that ASI would only be triggered if a line break separates tokens that would otherwise produce invalid syntax. If the next token can be parsed as part of a valid structure, semicolons would not be inserted. For example:

const a = 1
(1).toString()

const b = 1
[1, 2, 3].forEach(console.log)

Because () can be seen as a function call, it would usually not trigger ASI. Similarly, [] may be a member access. The code above is equivalent to:

const a = 1(1).toString();

const b = 1[1, 2, 3].forEach(console.log);

This happens to be valid syntax. 1[1, 2, 3] is a property accessor with a comma-joined expression. Therefore, you would get errors like "1 is not a function" and "Cannot read properties of undefined (reading 'forEach')" when running the code.

Within classes, class fields and generator methods can be a pitfall as well.

class A {
  a = 1
  *gen() {}
}

It is seen as:

class A {
  a = 1 * gen() {}
}

And therefore will be a syntax error around {.

There are the following rules-of-thumb for dealing with ASI, if you want to enforce semicolon-less style:

  const a = b
  ++
  console.log(a) // ReferenceError: Invalid left-hand side expression in prefix operation
  const a = b++
  console.log(a)
  function foo() {
    return
      1 + 1 // Returns undefined; 1 + 1 is ignored
  }
  function foo() {
    return 1 + 1
  }

  function foo() {
    return (
      1 + 1
    )
  }
  outerBlock: {
    innerBlock: {
      break
        outerBlock // SyntaxError: Illegal break statement
    }
  }
  outerBlock: {
    innerBlock: {
      break outerBlock
    }
  }
  const foo = (a, b)
    => a + b
  const foo = (a, b) =>
    a + b
  async
  function foo() {}
  async function
  foo() {}
  // The () may be merged with the previous line as a function call
  (() => {
    // ...
  })()

  // The [ may be merged with the previous line as a property access
  [1, 2, 3].forEach(console.log)

  // The ` may be merged with the previous line as a tagged template literal
  `string text ${data}`.match(pattern).forEach(console.log)

  // The + may be merged with the previous line as a binary + expression
  +a.toString()

  // The - may be merged with the previous line as a binary - expression
  -a.toString()

  // The / may be merged with the previous line as a division expression
  /pattern/.exec(str).forEach(console.log)
  ;(() => {
    // ...
  })()
  ;[1, 2, 3].forEach(console.log)
  ;`string text ${data}`.match(pattern).forEach(console.log)
  ;+a.toString()
  ;-a.toString()
  ;/pattern/.exec(str).forEach(console.log)
  class A {
    a = 1
    [b] = 2
    *gen() {} // Seen as a = 1[b] = 2 * gen() {}
  }
  class A {
    a = 1;
    [b] = 2;
    *gen() {}
  }

Specifications

Browser compatibility

See also