|
|
awk is an interpreted pattern-matching language with a wide range of applications. See Chapter 13, ``Using awk'' in the SCO OpenServer Operating System User's Guide for a complete discussion of its use.
You can enter an awk program '<progtext>' directly from the command-line, enclosing it in single quotes to prevent interpretation by the shell.
You can specify multiple -e programs and -f files and can use both forms together; all of the specified programs will be concatenated together (with intervening newlines) to form the program that is executed. This is similar to the -e and -f options in sed(C).
awk -We srcfile [arg ...] awk -f srcfile -- [arg ...]awk, like many utilities, allows the use of "--" to terminate the option list.
-We is mostly used in an executable awk program file that uses the #! mechanism documented on the exec(M) manual page. For awk, the #! line typically looks like this:
#!/usr/bin/awk -fSuch a line passes the following arguments to awk:
-f srcfile [args ...]where [args ...] are any arguments with which the program is invoked. If only filenames are passed, this works properly. However, if the first argument begins with '-', it appears to awk as though it is an argument that should be interpreted according to awk's command line syntax. This prevents an awk program from using POSIX-style options, which are introduced with '-'.
The -We construct avoids these problems. If a program begins with:
#!/usr/bin/awk -Wethe following is passed to awk:
-We srcfile [arg ...]which is equivalent to:
awk -f srcfile -- [arg ...]The -We version is used because the #! mechanism does not allow for the latter syntax. Because of the implicit "--", awk does not attempt to interpret any arguments as options to itself.
For example, passing a -q option to "#!/usr/bin/awk -f" will abort with an error. But passing -q to the following runs correctly:
#!/usr/bin/awk -WeWith the -We option, -q is stored in
ARGV[1]
,
where it can be interpreted by the awk program
as required.
-Wexec is a synonym for -We.
The remaining command-line options to awk are:
awk -F t is a special case that sets the field separator to a tab. (The field separator can also be changed within an awk program using the variable FS.)
Variable assignments that follow a filename take place after that file has been processed and before the next specified file is read.
An assignment placed before the first file argument is processed after the BEGIN actions. An assignment placed after the final file argument is processed before the END actions.
If there are no input files, awk executes the assignments before processing the standard input.
A pattern-action statement has the form:
pattern { action }
Either pattern or action may be omitted. If there is no action with a pattern, the matching line is printed. If there is no pattern with an action, the action is performed on every input line.
The opening brace ``{'' must be on the same line as the pattern for which the actions should be performed. Multiple action statements may appear on a single line if they are separated by semicolons ``;''.
A newline can be hidden with a backslash ``\'', so you can use backslash-newline to continue a long line.
Comments in awk are introduced by a number sign ``#'' and end with the end of the line. Comments can appear anywhere in a line.
Blank lines and whitespace (blanks and tabs) in an awk program are ignored.
You can change the field separator on the command line, as discussed earlier, using the -F field_sep option. You can also reset the value of the input field separator variable FS from within your awk program. FS can be set to any regular expression. The following action is a special case that resets FS to its default behavior:
BEGIN { FS = " " }The BEGIN in this example is a special pattern that matches before the first record is read; this is the mechanism awk provides for doing introductory processing.
Setting FS to a single blank is equivalent to:
BEGIN { FS = "[ \t]+" }That is, setting FS to a single blank tells awk to regard any combination of blanks and tabs (any whitespace) as a field separator. Note that once you set the input field separator to something other than a single blank (that is, to all whitespace), leading whitespace (before the first field) is no longer ignored.
awk is designed to consider each line of input as a complete record, but you can get awk to recognize multiline records by resetting the variable RS.
To get awk to recognize multiline records, set RS to the null string:
BEGIN { RS = "" }Now, awk will presume that records are separated by one or more blank lines. When you reset RS like this to use multiline records, newline is always considered a field separator, no matter what the value of FS is. To restore the default record separator, reset RS to a newline:
{ RS = "\n" }You can address any field in the input record using the syntax $1, $2, etc., where $1 is the first field in a record, $2 is the second field, and so on. The entire record is referred to as $0.
Fields can also be referred to in relation to the built-in field variables, for example, for a five-field record:
$(NF - 2)would refer to the third field. The NF in this example is a built-in variable awk provides that counts the number of fields in a current record. (Thus, $NF refers to the last field in the current record.)
The following list shows all the built-in variables in awk:
Variable | Meaning |
---|---|
ARGC | number of command-line arguments plus 1 |
ARGV | array of command-line arguments (ARGV[0 ... ARGC-1]) |
CONVFMT | format for converting numbers to strings (default: "%.6g"; see printf(S)). (Used by print.) |
ENVIRON | array of environment variables, indexed by the name of the variable |
FILENAME | name of current input file |
FNR | input record number in current file |
FS | input field separator (default: any whitespace) |
NF | number of fields in current input record |
NR | number of records read so far |
OFMT | output format for numbers (default: "%.6g"; see printf(S)) (Used by print.) |
OFS | output field separator (default: blank) |
ORS | output record separator (default: newline) |
RLENGTH | length of string matched by match |
RS | input record separator (default: newline) |
RSTART | index of first character matched by match |
SUBSEP | separates multiple subscripts in array elements (default: "\034") |
BEGIN and END match before the first line is read, and after the last line has been read, respectively.
All other patterns can contain extended regular expressions. See regexp(M) for the pattern-matching syntax of extended regular expressions. (In the following discussion, extended regular expressions will be referred to simply as regular expressions.)
You can create a string matching pattern using a regular expression in one of three ways:
Operator | Meaning |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | equal to |
!= | not equal to |
awk performs the comparison numerically if both operands are numeric, or if one is numeric and the other is a string with a numeric value. Otherwise, both arguments are converted to strings and compared character by character according to the sort (collation) order of the current locale (as determined by the environment variables LANG and LC_COLLATE). One string is less than another if it would appear earlier in the collation order.
Patterns can be joined using the logical operators && (AND) and || (OR). When patterns are joined like this, the pattern matches the current record if the entire pattern evaluates to true (non-zero or non-null). A pattern can be negated using the ! logical NOT operator. Parentheses may be used for grouping patterns.
The following pattern-matching expressions are available:
Action statements can be made up of:
Values are assigned to variables in the usual way in awk. For example:
a = 100creates a numeric variable a with the value ``100''. You can assign several variables in a single statement. For example:
water = oil = "wet"creates two string variables, water and oil, and sets them both to contain the string ``wet''.
Assignment operators are evaluated from right to left.
The following assignment operators are available; the shorthand assignment notation is borrowed from the C programming language:
Operator | Meaning |
---|---|
a=b | set a equal to b |
a+=b | set a equal to a + b |
a-=b | set a equal to a - b |
a*=b | set a equal to a * b |
a/=b | set a equal to a / b |
a%=b | set a equal to a % b; a becomes the remainder of a divided by b |
a^=b | set a equal to a ^ b; a becomes raised to the power b |
awk offers the usual arithmetic operators: ``+'' (add), ``-'' (subtract), ``*'' (multiply), ``/'' (divide), ``%'' (modulo; divide and give remainder), ``^'' (exponentiation; ``**'' is a synonym). The unary ``+'' (plus) and ``-'' (minus) are also available.
All arithmetic in awk is done in floating point.
Relational expressions in action statements use the same operators as relational expressions in patterns; consult the relational operators table in ``Patterns'' above.
The logical AND and logical OR (&& and ||) are also available, as well as the logical NOT (!, as in !expr).
There is also a conditional operator: ``?'':
expression1 ? expression2 : expression3
expression is evaluated, and if it is non-empty and non-zero, then the expression has the value of expression2. Otherwise, it has the value of expression3.
Variables can be incremented using prefix or postfix notation, as in C. x++ and ++x are both equivalent to x = x + 1, and both x-- and --x are equivalent to x = x-1. The difference between prefix (++x) and postfix (x++) is when x assumes its new value. In prefix notation, x is immediately incremented; in postfix notation, the current value of x is used and then x is incremented.
Parentheses can be used to alter the order of evaluation in arithmetic and relational expressions.
The following table of precedence shows all the available action statement operators and the order in which they are evaluated. The table is in decreasing order of precedence; operators higher in the table are evaluated before operators lower in the table.
Operator | Meaning |
---|---|
$ | field |
++ -- | increment, decrement (prefix and postfix) |
^ | exponentiation ( is a synonym) |
! | logical negation |
+ - | unary plus, unary minus |
/ % | multiply, divide, mod |
+ - | add, subtract |
(no explicit operator) | string concatenation |
< <= > >= != == | relationals |
~ !~ | regular expression match, negated match |
in | array membership |
&& | logical AND |
|| | logical OR |
?: | conditional expression |
= += -= *= /= %= ^= | assignment |
All of these operators are evaluated from left to right (they are left-associative), except for the assignment operators, the conditional expression operator, and exponentiation, which are evaluated from right to left (they are right-associative).
awk allows you to use strings as array subscripts; arrays that do this are called associative arrays. This lets you group together data quite simply.
For example, a data file lists employee names, department names, and the number of sick days the employee has taken:
Steve Engineering 2 Chris Engineering 1 Susannah Documentation 0 Vipin Sales 2 Connie Marketing 3 Matt Documentation 1 Nancy Sales 1 Nigel Documentation 0The first field, $1, contains the employee name; the second field, $2, contains the department, and the third field, $3, contains the number of sick days for that employee.
To accumulate the number of sick days in each department:
{ sickness[$2] += $3 }This creates the array sickness, which uses the values in the second field (``Engineering'', ``Documentation'', ``Sales'', and ``Marketing'') as its subscripts. The sick day totals in the third field are then collected under the appropriate subscript.
awk does not support multi-dimensional arrays, but this can be simulated by using a list of subscripts; see Chapter 13, ``Using awk'' in the SCO OpenServer Operating System User's Guide for details.
Each statement in a statement list should begin on a new line or after a semicolon.
The following constructs are available:
All three expressions are optional. This is often used to
go through a loop based on the value of a counter, where
expression1 is used to initialize a counter;
expression is the test; and expression2
increments the counter. While expression is non-empty and
non-zero, statement is executed.
do statement while (expression)
statement is repeatedly executed until expression becomes null or 0.
exit will go straight to the END statements, if
there are any. If exit occurs in an END
statement, the program itself exits. If a numeric expression is
given after exit, this expression is taken as the exit
status for the awk program.
print by itself is an abbreviation for print $0.
To print an empty line use:
print ""
Each successive expression replaces each formatting keyletter.
{ print $0 $2 > $3 }This statement means ``print the record and then field 2 into a file named by field 3,'' while:
{ print $0 ($2 > $3) }means ``print the record, followed by a 1 if field 2 is greater than field 3, or a 0 if it is not.''
printf keyletters are:
Keyletter | Prints expression as |
---|---|
%c | the ASCII character referred to by the least significant 8 bits of the numeric value of expression; truncates expression to the nearest integer. If the argument is a string, the first character of the string is used. |
%d | a decimal integer; truncates expression to the nearest integer |
%e | scientific notation using the form [-]d.ddddddE[+-]dd |
%f | scientific notation using the form [-]ddd.dddddd |
%g | the shorter of e or f conversion, with non-significant zeros suppressed |
%o | an unsigned octal number |
%s | a string |
%x | an unsigned hexadecimal number |
%% | prints a ``%'', no expression is converted |
The following escape sequences are recognized within regular expressions and strings:
Escape sequence | Meaning |
---|---|
\\ | backslash |
\a | alert or bell |
\b | backspace |
\f | formfeed |
\n | newline |
\r | carriage return |
\t | horizontal tab |
\v | vertical tab |
\ddd | octal value ddd |
Backslash followed by any other character is replaced by that character.
Output can be redirected into files using:
> filename
and
>> filename
Files are opened only once using the redirection operator. The first form will overwrite whatever is in filename, if filename already exists, and will create filename if it does not exist. The second form will append output to filename.
To send output to a pipe, use:
| command-line
where command-line is the command line to which you want to send the output. Filenames and command lines can be expressions, variables, or literal filenames or command lines. If you want to use a literal filename or command line, you must enclose it in double quotes, otherwise awk will treat it as a variable.
There is a limit to how many files and pipes you can open in an
awk program (see ``Limitations'' below). Use the
close statement to close files or pipes:
close(filename)
close(command-line)
where filename or command-line is the open file or pipe.
To read input from a file until the file runs out, use:
while ( ( getline x < file ) > 0) { ... }The ``> 0'' is needed so that the test catches a -1 error returned from getline. Otherwise, the while loop would read -1 as true, since it is non-zero.
Function | Returns |
---|---|
atan2(y,x) | arctangent of y/x in the range pi to pi |
cos(x) | cosine of x, with x in radians |
exp(x) | exponential function of x, e^x |
int(x) | integer part of x; truncated toward 0 when x > 0 |
log(x) | natural (base e) logarithm of x |
rand() | random number r, where 0 <= r < 1 |
sin(x) | sine of x, with x in radians |
sqrt(x) | square root of x |
srand() | set the seed for rand() from the time of day |
srand(x) | x is new seed for rand() |
The string functions are:
The & character has special meaning in the s substitution string: it is replaced with the text in the t string that is matched by the regular expression r. For example, if the var variable contains the string ``fob'', the following substitutions have these results:
Substitution string | Reulting value for var |
---|---|
sub (/[bo]/,"x&y",var) | fxoyb |
gsub(/[bo]/,"x&y",var) | fxoyxby |
To get a literal &, escape the & by preceding it with a backslace (\).
The & character has special meaning
in the s substitution string:
it is replaced with the text in the t string
that is matched by the regular expression r.
See the gsub description for examples.
This executes command-line and returns its exit status.
You can define your own functions in awk. The syntax for
this is:
function name(parameter-list) {
statements
}
name is the name of the function. Within the function, parameter-list is a comma-separated list of variable names which are the arguments with which the function was called: statements are action statements that make up the body of the function.
Function definitions can appear anywhere a pattern-action statement can appear. Recursion is permitted within user-defined functions; that is, a function may call itself directly or indirectly.
Variables passed to functions (as arguments) are copied, and a copy of the variable is manipulated by the function; that is, these variables are passed by value. The exception to this in awk is arrays, which are passed by reference, that is, the actual array elements are manipulated by the function, so array elements can be permanently altered, created, or deleted within a function.
Missing function arguments are set to null; extra arguments are ignored.
To define a return value for your function, you must include a
statement:
return expression
where expression is the value you want your function to return. expression here is optional; if you leave it out, control will be returned to the caller of the function, but the return value will be undefined. The return statement itself is optional as well.
The formal parameters of a function (the argument list) are local to
that function, but any other variables are global. You can use the
argument list as a way of creating variables local only to the
function; like other variables in awk these will be
automatically initialized with null values.
In an assignment statement, such as v=e, the type of v becomes the type of e. When the context is ambiguous, awk determines the types when the program runs.
In comparisons, awk compares both operands as numbers if both are numeric, or if one is numeric and the other is a numeric string. Otherwise, the operands are compared as strings. (A string is greater than another string if it comes later in the sort sequence, and less than another string if it comes earlier in the sort sequence.)
All field variables are of type string; in addition, each field can be considered to have a numeric value (that is, the numeric value of a string). The numeric value of a string is the value of the longest prefix of a string that looks numeric. For example, if a field contains the string ``123abc'', the numeric value of this would be 123.
The value of a variable in awk is initially 0 or the string "".
You can force a variable of one type to become another type; this is
known as type coercion. To force a number to a string,
use:
number ""
This concatenates the null string to number.
To force a string to a number, use:
string + 0
For more information about variable types, see Chapter 13, ``Using awk'' in the SCO OpenServer Operating System User's Guide
To print lines longer than 72 characters:
length > 72To print only the first two fields in the opposite order:
{ print $2, $1 }To print the same, with input fields separated by commas and/or blanks and tabs:
BEGIN { FS = ",[ \t]* | [ \t]+" } { print $2, $1 }To add up the first column, and print the sum and the average:
{ s += $1 } END {if ( NR > 0 ) print "sum is", s, " average is", s/NR }To print fields in reverse order (on separate lines):
{ for (i = NF; i > 0; --i) print $i }To print all lines between start/stop pairs:
/start/, /stop/To print all lines whose first field is different from the previous one:
$1 != prev { print; prev = $1 }To simulate echo(C):
BEGIN { for (i = 1; i < ARGC; i++) printf "%s ", ARGV[i] printf "\n" exit }To simulate env(C):
BEGIN { for (e in ENVIRON) print e "=" ENVIRON[e] }
Numbers are limited to what can be represented on your machine; numbers outside this range will have string values only.
Input whitespace is not preserved on output if fields are involved.
func is an obsolete synonym for function.
The awk provided on SCO OpenServer systems is based on the so-called ``new awk'' described in the 1988 book entitled The AWK Programming Language. When it was introduced, some systems provided the ``new awk'' as nawk, and the older one is oawk. SCO OpenServer retains the nawk and oawk names, but they are both linked to awk to provide backward compatibility. We recommend that you use the awk name rather than nawk and oawk.
Known incompatibilities between the current of awk and older awks include:
For example, the string:
123foodoes not have a numeric value in the old awk (and is treated as 0), but has the value 123 in the new awk.
{ $2 = $1; print }produces different output if the input has only one field.
For example, in regular expressions, the character class:
[/]is not legal in the new awk, but was in the old. The equivalent character class for the new awk is:
[\/]However, this character class, when used with the old awk, is not equivalent to the original expression.
Chapter 13, ``Using awk'' in the SCO OpenServer Operating System User's Guide
Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger,
The AWK Programming Language, Addison-Wesley, 1988.