Introduction to AWK
Introduction to AWK
Overview
AWK is a gem of a language. It is small, simple enough to learn in about an hour and many programs are so simple that they can be written in a line of code. AWK is included in most distributions of Linux.
My next blog post is going to be about ensuring data quality with AWK, so I thought I would write this post as an introduction to the language.
The data
Most of the examples will use a “fictitious” dataset that consists of a time series that represents daily rainfall and temperature data. The following is an example:
1910,1,6,0.0,37.0,22.5
1910,1,7,4.1,36.3,21.6
1910,1,8,18.8,34.4,17.8
1910,1,9,0.0,-99.9,-99.9
1910,1,10,0.0,-99.9,-99.9
1910,1,11,11.4,17.1,16.2
1910,1,12,21.1,20.7,15.4
1910,1,13,0.3,22.9,16.2
1910,1,14,7.9,23.2,16.2
1910,1,15,15.5,28.1,17.5
1910,1,16,0.0,28.9,-99.9
1910,1,17,0.0,23.7,-99.9
The first column is the year followed by the month and day. The last three columns are rainfall (in mm), maximum temperature and minimum temperature (in degrees Celsius). Missing values are represented by the value -99.9.
You can download the full data set by visiting this link:
Structure of an AWK program
An AWK program consists of a list of patterns and actions. AWK will go through the input line by line and try to match each pattern against the line (I will soon explain what is meant by a pattern). If the pattern matches, it will execute the action.
So, a template for an AWK program looks like this:
pattern1 {action1}
pattern2 {action2}
pattern3 {action3}
...
The second thing about AWK is that for each line (record) that it reads, it splits the line into fields. By default it will consider a space the field delimiter, but we can specify which character we want to use. The first field will be referred to as $1
, the second field as $2
and so on. We can also refer to the whole record as $0
.
Our first AWK program
We will use the file acorn01.csv
as our input. In our program we will print out all records for the year 1911. In other words, we want to print all records where $1
equals 1911. The pattern we will use is $1 == 1911
and the action will be print
. Since this is such a simple program, we can write it all on the command line as follows:
$ awk -F, '$1 == 1911 {print}' acorn01.csv
We read this as follows:
awk
- Invoke the program awk.
-F,
- Use a comma (,) as the field separator.
$1 == 1911
- This is our pattern. When it matches, execute the
action.
{print}
- Print the line. We could also have used print $0
.
acorn01.csv
- This is our input file.
When we run the program we get the following output:
1911,1,1,0.0,28.9,-99.9
1911,1,2,0.0,32.3,-99.9
1911,1,3,0.0,36.6,8.8
...
1911,12,29,0.0,-99.9,9.1
1911,12,30,0.0,36.0,20.5
1911,12,31,0.0,28.6,20.1
Of course I have left out a lot of the output. We can, however, use
another Unix utility to count the lines of output. wc
counts lines.
By using wc -l
we tell wc
to count only lines. We redirect the
output of awk
to wc
.
$ awk -F, '$1 == 1911 {print}' acorn01.csv | wc -l
And this gives us:
365
which is what we expect.
Suppose we only want to see the values for January, 1911, we could make our pattern more specific. We would type:
$ awk -F, '$1 == 1911 && $2 == 1 {print}' acorn01.csv
The &&
is the way that we specify that we want to meet the condition
that the first field equals 1911 and the second field equals 1
(January).
When running the above command, we see:
1911,1,1,0.0,28.9,-99.9
1911,1,2,0.0,32.3,-99.9
1911,1,3,0.0,36.6,8.8
...
1911,1,29,0.0,27.9,14.7
1911,1,30,0.0,29.4,17.1
1911,1,31,2.8,31.3,16.2
If we don’t specify any action, the default action is to print a line when a pattern is matched. This means that we can rewrite the above program as:
$ awk -F, '$1 == 1911 && $2 == 1' acorn01.csv
Now suppose we want to print the rainfall in 1911, but are only interested in days when the rainfall is above 20mm. We could do it as follows:
awk -F, '$1 == 1911 && $4 > 20' acorn01.csv
We can improve the output. First, we will print only the date and the rainfall:
$ awk -F, '$1 == 1911 && $4 > 20 {print $1, $2, $3, $4}' acorn01.csv
Which gives us:
1911 1 13 26.7
1911 1 18 22.9
1911 2 14 36.1
1911 3 8 20.1
1911 3 14 21.8
1911 5 18 20.1
1911 6 19 29.0
1911 11 28 45.2
We can improve it further as follows:
$ awk -F, '$1 == 1911 && $4 > 20 {print $3 "/" $2 "/" $1, $4}' \
acorn01.csv
The backslash (\) at the end of the line indicates that the line continues on the next line.
This gives the following output:
13/1/1911 26.7
18/1/1911 22.9
14/2/1911 36.1
8/3/1911 20.1
14/3/1911 21.8
18/5/1911 20.1
19/6/1911 29.0
28/11/1911 45.2
It is possible to improve the output further so that it looks like this:
1911-01-13 26.7
1911-01-18 22.9
1911-02-14 36.1
1911-03-08 20.1
1911-03-14 21.8
1911-05-18 20.1
1911-06-19 29.0
1911-11-28 45.2
but that is slightly beyond the scope of this tutorial.
BEGIN
and END
There are a number of special patterns. Two of these are BEGIN
and
END
. BEGIN
matches before any line of text has been read. END
matches after the last line of text has been read. Suppose we wanted to
have a header and a total, we could do this using BEGIN
and END
as
follows:
BEGIN {print "Date", "Rainfall"}
$1 == 1911 && $4 > 20 {print $3 "/" $2 "/" $1, $4; total += $4}
END {print "Total: ", total}
Save the above in a file. Call it rain.awk. We can now run it with the following command:
awk -F, -f rain.awk acorn01.csv
And we will see the following:
Date Rainfall
13/1/1911 26.7
18/1/1911 22.9
14/2/1911 36.1
8/3/1911 20.1
14/3/1911 21.8
18/5/1911 20.1
19/6/1911 29.0
28/11/1911 45.2
Total: 221.9