regexp(M)

regexp -- basic, extended, and shell regular expressions

Description

Regular expression handling is a pattern matching feature built into some UNIX commands and utilities. For example, you use regular expressions when searching for a word in the vi(C) editor, or when using grep(C) to find a string in a number of files. In both these cases, a supplied pattern is matched against a given set of strings; the utility selects those strings that conform to the pattern. The pattern must be in the form of a regular expression built from smaller elements of a language that has a well-defined grammar. The least complicated pattern that you could supply as a regular expression is a string; only an identical string will match exactly.

The type of expressions used in case statements and for matching filenames by the shells and by find(C), and by the commands used for accessing files (:e and :r) in vi, are known as shell regular expressions or case and file patterns. These differ from regular expressions; refer to ``Shell regular expressions'' for more information.

Traditionally, UNIX used a grammar that defined Simple Regular Expressions (SREs). Expressions that use this grammar are still supported. The ISO POSIX-2 DIS standard defines Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs). Both of these regular expression types are internationalized versions of SREs:

they interpret a character set and its ordering (collation sequence) according to the current locale
they provide mechanisms to make most regular expressions invariant from one locale to another

EREs have a different and richer grammar than BREs. However, EREs are not a superset of BREs; some BREs will not work as EREs without modification.

The awk(C) and egrep(C) utilities use EREs; all other utilities that handle regular expressions, such as ed(C), ex(C), expr(C), grep(C), sed(C), and vi use BREs.

The following sections describe the elements that are used to construct regular expressions. These sections are marked to indicate which are BRE, or ERE only.

Building regular expressions

The simplest form of regular expression is a string composed of characters that have been concatenated into a sequence. More complicated regular expressions generalize this by substituting one or more characters in a string with an operator and character sequence; such a regular expression will correspond to a larger number of matching strings.

A complication arises because the operators are themselves composed of characters; these are termed special characters. If you wish to search for one of the special characters as itself, see ``Searching for special characters''.

A BRE or ERE matches a string of zero or greater length where the characters in the string correspond to the pattern. The search begins at the start of a string and ends when the first matching sequence is found. If the expression allows there to be a variable number of characters in the matched string, the longest leftmost matching string is found.

Anchoring on the start of a line

The caret operator ``^'' is treated as an anchor when it occurs as the first character of a regular expression, or a BRE subexpression; the remainder of the expression only matches strings that occur at the start of a line.

Caret is treated as the literal caret character elsewhere in BREs, and always as an anchor in EREs; note that caret is also used to begin non-matching lists in bracket expressions (see ``Matching one from a set of characters'');

For example, the regular expression ``^début'' matches the first string ``début'' on a line that begins with ``débutant''; it would not match on the line that begins as ``au début''.

Anchoring on the end of a line

The dollar operator ``$'' is treated as an anchor when it occurs at the last character of a regular expression or a BRE subexpression; the preceding part of the expression only matches strings that occur at the end of a line.

Dollar is treated as the literal dollar character elsewhere in BREs, and always as an anchor in EREs.

For example, the regular expression ``fin$'' matches ``fin'' on a line that ends with the string ``le fin''; it would not match on the line that ends in ``finale''.

Anchoring on both the start and end of a line

If both the caret and dollar anchors are used, the regular expression must match a complete line.

For example, the regular expression ``^begin middle end$'' matches the complete line ``begin middle end''; it would not match the line ``begin middling end''.

Grouping expressions

A BRE subexpression is a regular expression enclosed by the operators ``$'' and ``$''.

An ERE group consists of a regular expression enclosed by the grouping operators ``('' and ``)''.

Subexpressions and groups match whatever the enclosed expression on its own would match. They are used to establish the expressions on which lower precedence operators should act (in the same way that parentheses are used in arithmetic). Any number of subexpressions or groups may be used, and they may be nested to any depth.

For example, the BRE ``$dog$matic'' matches the string ``dogmatic''. The equivalent ERE is ``(dog)matic''.

Referring to a previously matched subexpression (BRE)

A back-reference expression ``\#'' specifies a repetition of the string matched by the #th subexpression in the current regular expression. A null string (that has zero length) may be back-referenced. A back-reference may only specify subexpressions numbered 1 through 9; others are inaccessible.

For example, the expression ``$more$ and \1'' matches the string ``more and more''.

Matching alternate expressions (ERE)

The alternation operator ``|'' allows matching of either of the two EREs that it separates. This operator is usually used to match one of two alternate groups (see ``Grouping expressions'').

For example, the expression ``((auto)|(dog))matic'' matches the strings ``automatic'' and ``dogmatic''.

Matching a single character

A character matches itself unless it forms part of an operator.

For example, the expression ``explicit'' can only match itself.

Matching any single character

The dot operator ``.'' represents any single character except a null character.

For example, the expression ``..plicit'' can match ``explicit'' or ``implicit''.

Matching one from a set of characters

A set of characters to be matched is specified by enclosing a matching list or a non-matching list enclosed in bracket expression operators ``['' and ``]''.

A matching list is used to specify a set of alternative characters that may be matched against a single character (or a sequence of several characters that are treated as a single character by the current collation sequence, see ``Matching multi-character collating elements'').

A non-matching list specifies a set of characters that may not be matched against a single character; that is, it will match any character except those specified. A non-matching list is indicated by a leading caret ``^''.

If a matching list includes a right bracket ``]'', it must be the first character in the list.

If a non-matching list includes a right bracket ``]'', it must be the first character following the initial ``^''.

As an example of a bracket expression, ``[Qq]werty'' matches ``Qwerty'' or ``qwerty''.

The bracket expression in the regular expression ``str[^ae]ng'' uses a non-matching list to eliminate some possible matches; ``string'', ``strong'', or ``strung'' matches, but ``strang'' and ``streng'' do not.

Matching zero or more occurrences

The asterisk operator ``

'' matches zero or greater occurrences of the previous single character (including bracket expressions), BRE subexpression, ERE grouping, or BRE back-reference.

An asterisk is treated as itself if it occurs inside a bracket expression, as the first character of the regular expression (after any initial ``^'' anchor), as the first character of a BRE subexpression (after any initial anchor), as the first character of an ERE group, or if it is preceded by a single backslash ``\''.

For example, the expression ``a'' matches ``aaaa'' in the string ``aaaaba'', and it matches the null string in ``bbbbcb''.

The pattern ``[ab]'' matches ``aaab'' in ``daaabcbbb''.

Note that ``.'' matches any string of characters, so ``sub.'' would match ``subway'', ``submarine'', ``subjunctive'', and ``subsidy''.

The BRE ``$a.$cad\1'' matches ``abracadabra''; it also matches ``cad'' in the string ``academic'' since the first subexpression may be matched by a null string.

Matching one or more occurrences (ERE)

The plus sign operator ``+'' matches one or greater occurrences of the previous single character, or grouping.

A plus sign is treated as itself if it occurs inside a bracket expression, or if it is preceded by a single backslash ``\''.

For example, the expression ``a+'' matches ``aaaa'' in the string ``aaaaba'', but it does not match ``bbbbcb''.

The pattern ``e+p'' matches ``eep'' in ``sleep'', and ``ep'' in ``step''.

Matching zero or one occurrences (ERE)

The question mark operator ``?'' matches zero or one occurrences of the previous single character, or grouping.

A question mark is treated as itself if it occurs inside a bracket expression, or if it is preceded by a single backslash ``\''.

For example, ``l?i'' matches ``li'' in ``slip'', and ``i'' in ``sip''.

Matching a specified number of occurrences

A BRE interval expression has the syntax ``\{l\}'', ``\{l,\}'', or ``\{l,u\}''.

An ERE interval expression has the syntax ``{l}'', ``{l,}'', or ``{l,u}''.

An interval expression matches at least l, and at most u occurrences of the previous single character, subexpression, or back-reference. The lower limit, l, may not be less than 0. If l is specified without a trailing comma, exactly l occurrences are matched. The upper limit, u, must be less than or equal to 255 if it is specified. If u is omitted, but the comma is not, the upper limit on the number of occurrences is effectively infinite.

For example, the BRE expression ``e\{2\}'' matches ``ee'' in ``eleemosynary'', but finds no match in ``elementary''. The equivalent ERE expression is ``e{2}''.

The ERE expression ``(is{2}){2}'' matches ``ississ'' in ``Mississippi''.

Matching multi-character collating elements

Multi-character elements from a collation sequence are enclosed in collating symbol operators ``[.'' and ``.]''; these must in turn be part of a bracket expression.

Single-character elements are treated as themselves if specified as collating symbols; this is useful for representing characters such as dash ``-'' that have special meaning inside bracket expressions (see ``Specifying ranges of characters'').

For example, the expression ``[[.ij.]]'' matches only the collating element ``ij'' corresponding to the collating symbol <ij> in the current collation sequence; it is not the same as the bracket expression ``[ij]'' that matches ``i'' or ``j''.

Matching equivalent characters

Equivalent characters from a collation sequence may be matched by placing one of the elements of the equivalence class in equivalence class operators ``[='' and ``=]''; if the enclosed element is not from an equivalence class, the expression is interpreted as a collating symbol.

Note that only primary equivalence classes are recognized.

For example, if the characters ``e'', ``è'', ``é'', ``ê'', and ``ë'' are equivalent, then the bracket expression ``[d[=e=]f]'' is the same as ``[deèéêëf]''.

Matching classes of characters

Character classes are sets of characters as defined in the LC_TYPE category in the current locale. Characters from a class may be matched by enclosing the class name in the operators ``[:'' and ``:]''.

The following character classes are supported:

alnum: letters and numeric digits; in the POSIX locale this includes alpha and digit
alpha: letters; in the POSIX locale this includes upper and lower
blank: blank characters; in the POSIX locale this consists of space and tab only
cntrl: control characters
digit: numeric digits; in the POSIX locale this consists of:

0 1 2 3 4 5 6 7 8 9
graph: printable characters not including the space character
lower: lower case letters; in the POSIX locale this consists of:

a b c d e f g h i j k l m n o p q r s t u v w x y z
print: printable characters including the space character
punct: punctuation characters
space: whitespace characters; in the POSIX locale this consists of space, form feed, newline, carriage return, tab, and vertical tab.
upper: upper case letters; in the POSIX locale this consists of:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
xdigit: hexadecimal digits; in the POSIX locale this consists of:

0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
name: recognized if the name keyword has been defined as a charclass in the current locale.

For example, in the POSIX locale, the bracket expression ``[^[:alnum:]]'' matches on a non-alphanumeric character.

Specifying ranges of characters

A range of characters is indicated as a starting point and an ending point from a collation sequence separated by a dash ``-''.

The starting point must occur before the ending point in the collation sequence.

Only collating elements or collating symbols may be used for starting and ending points. For example, ``[[=a=]-z]'' is invalid, but ``[[.sz.]-z]'' is valid if <sz> occurs earlier than ``z'' in the current collation sequence.

The ending point of one range cannot be used as the starting point of another range. For instance, in the POSIX locale, ``[a-mn-z]'' is allowed, but ``[a-m-z]'' is interpreted as ``[a-m[.-.]z]''.

A range must specify both a starting and an ending point; otherwise, the dash ``-'' character is treated as itself. ``[-dot]'' and ``[dot-]'' match any character from ``d'', ``o'', ``t'', and ``-''; ``[^-dot]'' and ``[^dot-]'' match any character but these.

To specify a dash character as the start of a range, specify it as the first character in a matching list, after the initial caret ``^'' in a non-matching list, or enclose it in collation symbol operators:

[.-.]

For example, in the POSIX locale, the expression ``[^[:digit:][.-.]-/]'' matches any character but the numeric digits and the symbols ``-'', ``.'', and ``/''. The expression ``[[.-.]-a]'' or ``[--a]'' matches characters in the range ``-'' to ``a''; ``[!--]'' or ``[!-[.-.]]'' matches characters from ``!'' to ``-''.

Note that using range expressions within applications may make them non-portable; the collation sequence may differ for locales in a way that will influence the order of execution or cause errors.

For example, in the POSIX locale, the expression ``[0-9]'' is identical to ``[0123456789]'' (and to ``[[:digit:]]'').

Searching for special characters

The BRE special characters are:

. [ \

^ $

These characters are interpreted as themselves if they are used outside the context in which they are operators. You can force any of them to be interpreted as the character itself by preceding it with a backslash ``\''.

Note that the characters ( ) { } and the digits ``1'' through ``9'' have special meaning in BREs if they are preceded by a backslash.

The ERE special characters are:

. [ \ ( ) + ? { | ^ $

BRE precedence

This table shows the precedence of expressions in BREs from highest to lowest:

Expression type BRE operators

equivalence class, character class, collation symbol [==] [::] [..]

escaped special characters \character

bracket expressions []

subexpressions, back-references  \#

zero or more occurrences, interval expression \{l,u\}

expression concatenation

start, end anchoring ^ $

Expression type	BRE operators
equivalence class, character class, collation symbol	[==] [::] [..]
escaped special characters	\*character*
bracket expressions	[]
subexpressions, back-references	\(\) \#
zero or more occurrences, interval expression	\{l,u\}
expression concatenation
start, end anchoring	^ $

ERE precedence

This table shows the precedence of expressions in EREs from highest to lowest:

Expression type ERE operators

equivalence class, character class, collation symbol [==] [::] [..]

escaped special characters \character

bracket expressions []

grouping ()

zero or more, one or more, zero or one occurrences, interval expression + ? {l,u}

expression concatenation

start, end anchoring ^ $

alternation |

Expression type	ERE operators
equivalence class, character class, collation symbol	[==] [::] [..]
escaped special characters	\*character*
bracket expressions	[]
grouping	()
zero or more, one or more, zero or one occurrences, interval expression	+ ? {l,u}
expression concatenation
start, end anchoring	^ $
alternation	\|

Shell regular expressions

Shell regular expressions (case and file patterns) differ in several respects from regular expressions. A case pattern is matched with the value of a shell variable. Case patterns are used by case control statements in the shells. A file pattern is replaced by an alphabetically sorted list of filenames that match it. File patterns are used by the shells and find to generate a list of matching filenames. The vi family of editors ( ex, edit(C), vi, view(C), and vedit(C)) also use file patterns for matching filenames when reading and writing files. Otherwise, they use BREs for pattern matching.

The following case and file pattern operators are all available in ksh, and the vi editors. csh, find, and sh allow the use of a more restricted set. (The utilities that you can use the patterns with are shown in parentheses; vi represents the vi family of editors.)

?: The equivalent of the regular expression operator ``.''; it stands for any single character within a filename. The shells prevent this operator matching ``.'' at the start of a filename (to conceal hidden files). For example, ls foo.? lists all the files with single character file extensions and whose names begin with ``foo.''. The commands ls ?profile and ls $HOME/?profile will not list the hidden file .profile even if it exists. (Available in csh, find, ksh, sh, and vi.)
\character: A special character (normally used to represent an operator) must be prefixed by a backslash ``\'' (escaped) if it is to stand for itself. The backslash is used to escape the following special characters (shell metacharacters) outside a bracket expression:

? ( | & ) [
Only the special character ``]'' must be escaped inside a bracket expression. (Available in csh, find, ksh, sh, and vi.)
[]: The bracket expression is the same as for regular expressions, and supports ranges. For example, ls [0-9] lists all files that begin with a digit from ``0'' to ``9''. If a range is given in an order that does not correspond to the current collation sequence, the entire bracket expression is treated as a literal string to be searched for. (Available in csh, find, ksh, sh, and vi.)
[!]: The operator ! inside a bracket expression is the equivalent of the regular expression caret operator in forming non-matching lists. For example, ls [!A-Za-z] lists all files that do not begin with an upper or lower case alphabetic character. Note that the shell suppresses the listing of hidden files (that have filenames starting with ``.'') even though they are non-matching. (Available in csh, find, ksh, sh, and vi.)
[..] [==] [::]: The operators [..] for collating symbol, [==] for equivalence class, and [::] for character class expressions may be used within bracket expressions only. (Available in find, ksh, and vi.)
: This is the equivalent of the regular expression ``.''; it stands for a string of any characters. The shells prevent matching with filenames that start with a dot ``.'' (to conceal hidden files). For example, ls .c lists all files in the current directory that have the extension ``.c'', but it would not list a file named .foo.c. (Available in csh, find, ksh, sh, and vi.)
@(pattern [|pattern ...): Match exactly one occurrence of a group. This operator acts on groups only and must prefix the group. For example, ``@(abc)'' matches one occurrence of the string ``abc''. (Available in ksh, and vi.)
(pattern [|pattern ...): Matzch zero or more of the given patterns. For example, (xyz) matches zero or more occurrences of the string ``xyz''. (Available in ksh, and vi.)
+(pattern [|pattern ...): Match one or more of the given patterns. For example, ``+(abc)'' matches one or more occurrences of the string ``abc''. (Available in ksh, and vi.)
?(pattern [|pattern ...): Match zero or one of the given parameters. For example, ``?(def)'' matches zero or one occurrences of the string ``def''. (Available in ksh, and vi.)
! (pattern [|pattern ...): Matches anything except (zero occurrences) of a group. For example, ``!(xyz)'' matches if the string ``xyz'' does not occur. (Available in ksh, and vi.)
{expr1,expr2[,expr3]...]}: Generate a pattern for each expression listed (expr1, expr2 and so on). This is equivalent to a combination of the ERE group and alternation operators. For example, the expression ``a{b,c,l}e'' is equivalent to the ERE ``a((b)|(c)|(l))e''. In the Korn shell, the command ls a{b,c,l}e would be expanded as ls abe ace ale. (Available in csh, ksh, and vi.)
|: Equivalent to the ERE alternation operator. (Available in ksh, and vi.)
&: The conjunction operator is equivalent to regular expression concatenation. (Available in ksh, and vi.)

Case and file patterns do not recognize:

the caret operator ``^'' for beginning non-matching lists
BRE back-references

Limitations

Shell regular expressions differ from regular expressions; see ``Shell regular expressions'' here and in the Operating System User's Guide for further details.

No more than the first 9 sub-expressions may be back-referenced within a regular expression.

Interval expressions may not specify an upper limit greater than 255; if not specified, the limit is effectively infinite.

Note that a collation sequence is not necessarily equivalent to a collation order. See localedef(F) for more details.

Standards conformance

regexp is conformant with:

X/Open CAE Specification, Commands and Utilities, Issue 4, 1992.