UNB/ CS/ David Bremner/ teaching/ cs2613/ books/ mdn/ Reference/ Regular expressions

A regular expression (regex for short) allow developers to match strings against a pattern, extract submatch information, or simply test if the string conforms to that pattern. Regular expressions are used in many programming languages, and JavaScript's syntax is inspired by Perl.

You are encouraged to read the regular expressions guide to get an overview of the available regex syntaxes and how they work.

Description

Regular expressions are a important concept in formal language theory. They are a way to describe a possibly infinite set of character strings (called a language). A regular expression, at its core, needs the following features:

Assuming a finite alphabet (such as the 26 letters of the English alphabet, or the entire Unicode character set), all regular languages can be generated by the features above. Of course, many patterns are very tedious to express this way (such as "10 digits" or "a character that's not a space"), so JavaScript regular expressions include many shorthands, introduced below.

Note: JavaScript regular expressions are in fact not regular, due to the existence of backreferences (regular expressions must have finite states). However, they are still a very useful feature.

Creating regular expressions

A regular expression is typically created as a literal by enclosing a pattern in forward slashes (/):

const regex1 = /ab+c/g;

Regular expressions can also be created with the RegExp() constructor:

const regex2 = new RegExp("ab+c", "g");

They have no runtime differences, although they may have implications on performance, static analyzability, and authoring ergonomic issues with escaping characters. For more information, see the RegExp reference.

Regex flags

Flags are special parameters that can change the way a regular expression is interpreted or the way it interacts with the input text. Each flag corresponds to one accessor property on the RegExp object.

Flag Description Corresponding property
d Generate indices for substring matches. hasIndices
g Global search. global
i Case-insensitive search. ignoreCase
m Allows ^ and $ to match newline characters. multiline
s Allows . to match newline characters. dotAll
u "Unicode"; treat a pattern as a sequence of Unicode code points. unicode
v An upgrade to the u mode with more Unicode features. unicodeSets
y Perform a "sticky" search that matches starting at the current position in the target string. sticky

The sections below list all available regex syntaxes, grouped by their syntactic nature.

Assertions

Assertions are constructs that test whether the string meets a certain condition at the specified position, but not consume characters. Assertions cannot be quantified.

Atoms

Atoms are the most basic units of a regular expression. Each atom consumes one or more characters in the string, and either fails the match or allows the pattern to continue matching with the next atom.

Other features

These features do not specify any pattern themselves, but are used to compose patterns.

Escape sequences

Escape sequences in regexes refer to any kind of syntax formed by \ followed by one or more characters. They may serve very different purposes depending on what follow \. Below is a list of all valid "escape sequences":

Escape sequence Followed by Meaning
\B None Non-word-boundary assertion
\D None Character class escape representing non-digit characters
\P {, a Unicode property and/or value, then } Unicode character class escape representing characters without the specified Unicode property
\S None Character class escape representing non-white-space characters
\W None Character class escape representing non-word characters
\b None Word boundary assertion; inside character classes, represents U+0008 (BACKSPACE)
\c A letter from A to Z or a to z A character escape representing the control character with value equal to the letter's character value modulo 32
\d None Character class escape representing digit characters (0 to 9)
\f None Character escape representing U+000C (FORM FEED)
\k <, an identifier, then > A named backreference
\n None Character escape representing U+000A (LINE FEED)
\p {, a Unicode property and/or value, then } Unicode character class escape representing characters with the specified Unicode property
\q {, a string, then a } Only valid inside v-mode character classes; represents the string to be matched literally
\r None Character escape representing U+000D (CARRIAGE RETURN)
\s None Character class escape representing whitespace characters
\t None Character escape representing U+0009 (CHARACTER TABULATION)
\u 4 hexadecimal digits; or {, 1 to 6 hexadecimal digits, then } Character escape representing the character with the given code point
\v None Character escape representing U+000B (LINE TABULATION)
\w None Character class escape representing word characters (A to Z, a to z, 0 to 9, _)
\x 2 hexadecimal digits Character escape representing the character with the given value
\0 None Character escape representing U+0000 (NULL)

\ followed by any other digit character becomes a legacy octal escape sequence, which is forbidden in Unicode-aware mode.

In addition, \ can be followed by some non-letter-or-digit characters, in which case the escape sequence is always a character escape representing the escaped character itself:

The other characters, namely space character, ", ', _, and any letter character not mentioned above, are not valid escape sequences. In Unicode-unaware mode, escape sequences that are not one of the above become identity escapes: they represent the character that follows the backslash. For example, \a represents the character a. This behavior limits the ability to introduce new escape sequences without causing backward compatibility issues, and is therefore forbidden in Unicode-aware mode.

Specifications

Browser compatibility

See also