regexp --
basic, extended, and shell regular expressions
Description
Regular expression handling is a pattern matching feature built into
some UNIX commands and utilities. For example, you use regular expressions
when searching for a word in the
vi(C)
editor, or when using
grep(C)
to find a string in a number of files.
In both these cases, a supplied pattern is matched against a given
set of strings; the utility selects those strings that conform to
the pattern.
The pattern must be in the form of a regular
expression built from smaller elements of a
language that has a well-defined grammar.
The least complicated pattern that you could supply as a regular
expression is a string; only an identical string will match exactly.
The type of expressions used in case statements
and for matching filenames by the shells and by
find(C),
and by the commands used for accessing files (:e and
:r) in
vi,
are known as shell regular expressions or case
and file patterns. These differ from regular expressions;
refer to
``Shell regular expressions''
for more information.
Traditionally, UNIX used a grammar that defined Simple
Regular Expressions (SREs). Expressions that use this
grammar are still supported. The ISO POSIX-2 DIS standard
defines Basic Regular Expressions (BREs) and
Extended Regular Expressions (EREs).
Both of these regular expression types are internationalized versions
of SREs:
they interpret a character set and its ordering (collation sequence)
according to the current locale
they provide mechanisms to make most regular expressions invariant
from one locale to another
EREs have a different and richer grammar than BREs.
However, EREs are not a superset of BREs; some
BREs will not work as EREs without modification.
The following sections describe the elements that are used to
construct regular expressions. These sections are
marked to indicate which are BRE, or ERE
only.
Building regular expressions
The simplest form of regular expression is a string composed of
characters that have been concatenated into a sequence. More
complicated regular expressions generalize this by
substituting one or more characters in a string with an operator and
character sequence; such a regular expression will correspond to a larger
number of matching strings.
A complication arises because the operators are themselves composed
of characters; these are termed special characters.
If you wish to search for one of the special characters as itself,
see ``Searching for special characters''.
A BRE or ERE matches a string of zero or
greater length where the characters in the string correspond to
the pattern. The search begins at the start of a string and
ends when the first matching sequence is found. If the expression
allows there to be a variable number of characters in the matched
string, the longest leftmost matching string is found.
Anchoring on the start of a line
The caret operator ``^'' is treated as an anchor
when it occurs as the first character of a regular expression, or a
BRE subexpression; the remainder of the expression only
matches strings that occur at the start of a line.
Caret is treated as the literal caret character elsewhere in
BREs, and always as an anchor in EREs;
note that caret is also used to begin non-matching lists in bracket
expressions (see ``Matching one from a set of characters'');
For example, the regular expression ``^début'' matches the
first string ``début'' on a
line that begins with ``débutant''; it would not
match on the line that begins as ``au début''.
Anchoring on the end of a line
The dollar operator ``$'' is treated as an anchor
when it occurs at the last character of a regular expression or a
BRE subexpression;
the preceding part of the expression only matches strings
that occur at the end of a line.
Dollar is treated as the literal dollar character elsewhere in
BREs, and always as an anchor in EREs.
For example, the regular expression ``fin$'' matches
``fin'' on a
line that ends with the string ``le fin''; it would not
match on the line that ends in ``finale''.
Anchoring on both the start and end of a line
If both the caret and dollar anchors are used, the regular
expression must match a complete line.
For example, the regular expression ``^begin middle end$''
matches the complete line ``begin middle end''; it would
not match the line ``begin middling end''.
Grouping expressions
A BRE subexpression is a regular expression enclosed
by the operators ``\('' and ``\)''.
An ERE group consists of a regular expression
enclosed by the grouping operators ``('' and ``)''.
Subexpressions and groups match whatever the enclosed
expression on its own would match. They are used to establish
the expressions on which lower precedence operators should act
(in the same way that parentheses are used in arithmetic).
Any number of subexpressions or groups may be used, and they
may be nested to any depth.
For example, the BRE ``\(dog\)matic'' matches the
string ``dogmatic''.
The equivalent ERE is ``(dog)matic''.
Referring to a previously matched subexpression (BRE)
A back-reference expression ``\#''
specifies a repetition of the string matched by the #th
subexpression in the current regular expression.
A null string (that has zero length) may be back-referenced.
A back-reference may only specify subexpressions numbered 1 through 9;
others are inaccessible.
For example, the expression ``\(more\) and \1''
matches the string ``more and more''.
Matching alternate expressions (ERE)
The alternation operator ``|'' allows matching of either of
the two EREs that it separates. This operator is usually used
to match one of two alternate groups (see ``Grouping expressions'').
For example, the expression ``((auto)|(dog))matic''
matches the strings ``automatic'' and ``dogmatic''.
Matching a single character
A character matches itself unless it forms part of an operator.
For example, the expression ``explicit'' can only match
itself.
Matching any single character
The dot operator ``.'' represents any single character
except a null character.
For example, the expression ``..plicit'' can match
``explicit'' or ``implicit''.
Matching one from a set of characters
A set of characters to be matched is specified by enclosing
a matching list or a non-matching list
enclosed in bracket expression operators
``['' and ``]''.
A matching list is used to specify a set of alternative
characters that may be matched against a single character (or a
sequence of several characters that are treated as a single character
by the current collation sequence, see
``Matching multi-character collating elements'').
A non-matching list specifies a set of characters that may
not be matched against a single character; that is, it will match any
character except those specified. A non-matching list is indicated
by a leading caret ``^''.
If a matching list includes a right bracket ``]'', it must be
the first character in the list.
If a non-matching list includes a right bracket ``]'',
it must be the first character following the initial ``^''.
As an example of a bracket expression, ``[Qq]werty'' matches
``Qwerty'' or ``qwerty''.
The bracket expression in the regular expression ``str[^ae]ng''
uses a non-matching list to eliminate some possible matches;
``string'', ``strong'', or ``strung'' matches,
but ``strang'' and ``streng'' do not.
Matching zero or more occurrences
The asterisk operator ``'' matches zero or greater
occurrences of the previous single character (including bracket
expressions), BRE subexpression,
ERE grouping, or BRE back-reference.
An asterisk is treated as itself if it occurs inside a bracket expression,
as the first character of the
regular expression (after any initial ``^'' anchor), as the first
character of a BRE subexpression (after any initial anchor),
as the first character of an ERE group,
or if it is preceded by a single backslash ``\''.
For example, the expression ``a'' matches ``aaaa''
in the string ``aaaaba'', and it matches the null string
in ``bbbbcb''.
The pattern ``[ab]'' matches ``aaab''
in ``daaabcbbb''.
Note that ``.'' matches any string of characters, so
``sub.'' would match ``subway'',
``submarine'', ``subjunctive'', and
``subsidy''.
The BRE ``\(a.\)cad\1'' matches
``abracadabra''; it also matches ``cad'' in the
string ``academic'' since the first subexpression may be
matched by a null string.
Matching one or more occurrences (ERE)
The plus sign operator ``+'' matches one or greater
occurrences of the previous single character, or grouping.
A plus sign is treated as itself if it
occurs inside a bracket expression, or if it is preceded by a
single backslash ``\''.
For example, the expression ``a+'' matches ``aaaa''
in the string ``aaaaba'', but it does not match
``bbbbcb''.
The pattern ``e+p'' matches ``eep'' in
``sleep'', and ``ep'' in ``step''.
Matching zero or one occurrences (ERE)
The question mark operator ``?'' matches zero or one
occurrences of the previous single character, or grouping.
A question mark is treated as itself if it
occurs inside a bracket expression, or if it is preceded by a
single backslash ``\''.
For example, ``l?i'' matches ``li'' in ``slip'',
and ``i'' in ``sip''.
Matching a specified number of occurrences
A BRE interval expression has the syntax
``\{l\}'',
``\{l,\}'', or
``\{l,u\}''.
An ERE interval expression has the syntax
``{l}'',
``{l,}'', or
``{l,u}''.
An interval expression matches at least l, and
at most u occurrences of the previous single character,
subexpression, or back-reference. The lower limit, l, may
not be less than 0. If l is specified without a
trailing comma, exactly l occurrences are matched.
The upper limit, u, must be less than or equal to 255 if
it is specified. If u is omitted, but the comma is not,
the upper limit on the number of occurrences is effectively infinite.
For example, the BRE expression ``e\{2\}''
matches ``ee'' in ``eleemosynary'', but finds no
match in ``elementary''. The equivalent ERE
expression is ``e{2}''.
The ERE expression ``(is{2}){2}'' matches
``ississ'' in ``Mississippi''.
Matching multi-character collating elements
Multi-character elements from a collation sequence are enclosed
in collating symbol operators ``[.'' and ``.]'';
these must in turn be part of a bracket expression.
Single-character elements are treated as themselves if
specified as collating symbols; this is useful for representing characters
such as dash ``-'' that have special meaning inside bracket
expressions (see ``Specifying ranges of characters'').
For example, the expression ``[[.ij.]]'' matches only the
collating element ``ij'' corresponding to the collating
symbol <ij> in the current collation sequence; it is not the same
as the bracket expression ``[ij]'' that matches ``i'' or
``j''.
Matching equivalent characters
Equivalent characters from a collation sequence may be matched by
placing one of the elements of the equivalence class in
equivalence class operators ``[='' and
``=]''; if the enclosed element is not from an equivalence
class, the expression is interpreted as a collating symbol.
Note that only primary equivalence classes are recognized.
For example, if the characters ``e'', ``è'',
``é'', ``ê'', and ``ë''
are equivalent, then
the bracket expression ``[d[=e=]f]'' is the same as
``[deèéêëf]''.
Matching classes of characters
Character classes are sets of characters as defined in the
LC_TYPE category in the current locale. Characters
from a class may be matched by enclosing the class name in
the operators ``[:'' and ``:]''.
The following character classes are supported:
alnum
letters and numeric digits;
in the POSIX locale
this includes alpha and digit
alpha
letters;
in the POSIX locale
this includes upper and lower
blank
blank characters;
in the POSIX locale
this consists of space and tab only
cntrl
control characters
digit
numeric digits;
in the POSIX locale
this consists of:
0 1 2 3 4 5 6 7 8 9
graph
printable characters not including the space character
lower
lower case letters;
in the POSIX locale
this consists of:
a b c d e f g h i j k l m n o p q r s t u v w x y z
print
printable characters including the space character
punct
punctuation characters
space
whitespace characters;
in the POSIX locale
this consists of space, form feed, newline, carriage return, tab,
and vertical tab.
upper
upper case letters;
in the POSIX locale
this consists of:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
xdigit
hexadecimal digits;
in the POSIX locale
this consists of:
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
name
recognized if the name keyword has been defined as a
charclass in the current locale.
For example, in the POSIX locale, the bracket expression
``[^[:alnum:]]'' matches on a non-alphanumeric character.
Specifying ranges of characters
A range of characters is indicated as a starting point and an
ending point from a collation sequence separated by a dash ``-''.
The starting point must occur before the ending point in the
collation sequence.
Only collating elements or collating symbols
may be used for starting and ending points.
For example, ``[[=a=]-z]'' is invalid, but
``[[.sz.]-z]'' is valid if <sz> occurs earlier than ``z''
in the current collation sequence.
The ending point of one range cannot be used as the starting
point of another range. For instance, in the POSIX
locale, ``[a-mn-z]'' is allowed, but ``[a-m-z]''
is interpreted as ``[a-m[.-.]z]''.
A range must specify both a starting and an ending point;
otherwise, the dash ``-'' character is treated as itself.
``[-dot]'' and ``[dot-]'' match any character from
``d'', ``o'', ``t'', and ``-'';
``[^-dot]'' and ``[^dot-]'' match any character but
these.
To specify a dash character as the start of a range, specify it
as the first character in a matching list, after the initial caret
``^'' in a non-matching list, or enclose it in collation
symbol operators:
[.-.]
For example, in the POSIX locale,
the expression ``[^[:digit:][.-.]-/]'' matches any
character but the numeric digits and the symbols ``-'',
``.'', and ``/''.
The expression ``[[.-.]-a]'' or ``[--a]'' matches
characters in the range ``-'' to ``a'';
``[!--]'' or ``[!-[.-.]]'' matches characters
from ``!'' to ``-''.
Note that using range expressions within applications may make
them non-portable; the collation sequence may differ for
locales in a way that will influence the order of execution or cause
errors.
For example, in the POSIX locale, the expression
``[0-9]'' is identical to ``[0123456789]'' (and to
``[[:digit:]]'').
Searching for special characters
The BRE special characters are:
. [ \ ^ $
These characters are interpreted as themselves if they are used
outside the context in which they are operators. You can force
any of them to be interpreted as the character itself by
preceding it with a backslash ``\''.
Note that the characters ( ) { } and the digits ``1''
through ``9'' have special meaning in BREs if they are
preceded by a backslash.
The ERE special characters are:
. [ \ ( ) + ? { | ^ $
These characters are interpreted as themselves if they are used
outside the context in which they are operators. You can force
any of them to be interpreted as the character itself by
preceding it with a backslash ``\''.
BRE precedence
This table shows the precedence of expressions in
BREs from highest to lowest:
Expression type
BRE operators
equivalence class, character class, collation symbol
[==] [::] [..]
escaped special characters
\character
bracket expressions
[]
subexpressions, back-references
\(\) \#
zero or more occurrences, interval expression
\{l,u\}
expression concatenation
start, end anchoring
^ $
ERE precedence
This table shows the precedence of expressions in
EREs from highest to lowest:
Expression type
ERE operators
equivalence class, character class, collation symbol
[==] [::] [..]
escaped special characters
\character
bracket expressions
[]
grouping
()
zero or more, one or more, zero or one
occurrences, interval expression
+ ? {l,u}
expression concatenation
start, end anchoring
^ $
alternation
|
Shell regular expressions
Shell regular expressions (case and file patterns)
differ in several respects from regular expressions.
A case pattern is matched with the value of a shell variable.
Case patterns are used by case control statements in
the shells.
A file pattern is replaced by an alphabetically sorted list of
filenames that match it.
File patterns are used by the shells and
find
to generate a list of matching filenames.
The vi
family of editors (
ex,
edit(C),
vi,
view(C),
and
vedit(C))
also use file patterns for matching filenames when reading and
writing files. Otherwise, they use BREs for pattern
matching.
The following case and file pattern operators are all available
in ksh, and the vi editors.
csh, find, and sh allow the use
of a more restricted set. (The utilities that you can use the
patterns with are shown in parentheses;
vi represents the vi family of editors.)
?
The equivalent of the regular expression operator ``.'';
it stands for any single character within a filename. The shells
prevent this operator matching ``.'' at the start of a filename
(to conceal hidden files). For example, ls foo.?
lists all the files with single character file extensions
and whose names begin with ``foo.''. The commands
ls ?profile and ls $HOME/?profile will
not list the hidden file .profile even if it exists.
(Available in csh, find,
ksh, sh, and vi.)
\character
A specialcharacter (normally used to
represent an operator) must be prefixed by a backslash ``\''
(escaped) if it is to stand for itself.
The backslash is used to escape the following special characters
(shell metacharacters) outside a bracket expression:
? ( | & ) [
Only the special character ``]'' must be escaped inside a
bracket expression.
(Available in csh, find,
ksh, sh, and vi.)
[]
The bracket expression is the same as for regular expressions,
and supports ranges.
For example, ls [0-9] lists all files that begin
with a digit from ``0'' to ``9''. If a range is given
in an order that does not correspond to the current collation
sequence, the entire bracket expression is treated as a literal
string to be searched for.
(Available in csh, find,
ksh, sh, and vi.)
[!]
The operator ! inside a bracket expression is the equivalent of
the regular expression caret operator in forming non-matching lists.
For example, ls [!A-Za-z] lists all files that do
not begin with an upper or lower case alphabetic character.
Note that the shell suppresses the listing of hidden files
(that have filenames starting with ``.'')
even though they are non-matching.
(Available in csh, find,
ksh, sh, and vi.)
[..] [==] [::]
The operators [..] for collating symbol,
[==] for equivalence class, and [::] for character class
expressions may be used within bracket expressions only.
(Available in find, ksh,
and vi.)
This is the equivalent of the regular expression ``.'';
it stands for a string of any characters.
The shells prevent matching with filenames
that start with a dot ``.''
(to conceal hidden files).
For example, ls .c
lists all files in the current directory
that have the extension ``.c'',
but it would not list a file named .foo.c.
(Available in csh, find,
ksh, sh, and vi.)
@(pattern [|pattern ...)
Match exactly one occurrence of a group.
This operator acts on groups only and must prefix the
group. For example, ``@(abc)'' matches one
occurrence of the string ``abc''.
(Available in ksh, and vi.)
(pattern [|pattern ...)
Matzch zero or more of the given patterns.
For example, (xyz) matches
zero or more occurrences of the string ``xyz''.
(Available in ksh, and vi.)
+(pattern [|pattern ...)
Match one or more of the given patterns.
For example, ``+(abc)'' matches one or more
occurrences of the string ``abc''.
(Available in ksh, and vi.)
?(pattern [|pattern ...)
Match zero or one of the given parameters.
For example, ``?(def)'' matches zero or one
occurrences of the string ``def''.
(Available in ksh, and vi.)
! (pattern [|pattern ...)
Matches anything except (zero occurrences) of a group.
For example, ``!(xyz)''
matches if the string ``xyz'' does not occur.
(Available in ksh, and vi.)
{expr1,expr2[,expr3]...]}
Generate a pattern for each expression listed
(expr1, expr2 and so on). This is equivalent
to a combination of the ERE group and alternation
operators. For example, the expression ``a{b,c,l}e'' is
equivalent to the ERE ``a((b)|(c)|(l))e''.
In the Korn shell, the command ls a{b,c,l}e would be
expanded as ls abe ace ale.
(Available in csh, ksh, and vi.)
|
Equivalent to the ERE alternation operator.
(Available in ksh, and vi.)
&
The conjunction operator is equivalent to regular expression
concatenation.
(Available in ksh, and vi.)
Case and file patterns do not recognize:
the caret operator ``^'' for beginning non-matching lists
BRE back-references
Limitations
Shell regular expressions differ from regular expressions;
see ``Shell regular expressions'' here and in the Operating System User's Guide for further details.
No more than the first 9 sub-expressions may be back-referenced
within a regular expression.
Interval expressions may not specify an upper limit greater than 255;
if not specified, the limit is effectively infinite.
Note that a collation sequence is not necessarily equivalent
to a collation order. See
localedef(F)
for more details.