Linux text processing artifact awk actual combat case

Posted by john87423 on Sat, 22 Jan 2022 07:54:09 +0100

1. What is awk

  AWK is a domain specific language specially designed for text processing, which is usually used for data extraction and data reporting. Its functions are similar to those of sed and grep, which can be filtered, and they are standard tools for most Linux systems.

  in the 1970s, AWK was born in Bell Labs. Its name originates from the initials of the three authors' surnames. AWK language is a data-driven language. It is composed of a series of operations on text stream. It can not only operate on files, but also on pipelines. Finally, it can extract or process text, such as generating formatted reports. AWK widely uses string types, associative arrays and regular expressions.

2. Print the contents of different columns

2.1 data files

  first, data Txt is the input file of awk tool. Its contents are as follows. The first column is the material, the second column is the weight, the third column is the year of issue, the fourth column is the country of issue, and the last column is the name of ancient coins:

gold 1 1908 Austria-Hungary Franz josef 100 Korona
gold 1 1985 Canada Maple leaf
silver 10 1981 USA ingot
gold 1 1983 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 1 1986 USA American Eagle
silver 1 1986 USA Liberty dollar
silver 1 1986 USA Liberty 50-cent piece
gold 0.25 1986 USA Liberty 5-dollar piece
silver 1 1987 USA Constitution dollar
gold 0.1 1986 PRC Panda
gold 0.25 1987 USA Constitution 5-dollar piece
gold 1 1984 Switzerland ingot

  the data in the last column includes an indefinite length of spaces, which is often a difficulty in text processing.

2.2 three modes of awk processing

  • BEGIN {here is the statement before execution}
  • END {this is the statement to be executed after all rows are processed}
  • {here are the statements to be executed when processing each line}

2.3 print all contents

  how to use awk to print all the contents of a file?

awk '{print}' data.txt


  but the above command and cat data Txt is equivalent, as shown in the figure:

2.4 print column N

   the code for printing the first column is as follows:

awk '{print $1}' data.txt


   to print the nth column, just change 1 in $1 to N, such as the third column:

awk '{print $3}' data.txt

2.5 printing multiple columns

    the previous section talked about printing the content of a line. If you have learned Python, you often think about whether there is a slicing operation, that is, whether you can take out multiple columns of content.

   the commands for taking out the first three columns at the same time are as follows:

awk '{print $1, $2, $3}' data.txt

2.6 print multiple columns and display them aligned

   you can align by adding tab in each column. It should be noted that characters or strings need to be expressed in double quotation marks in the awk command:

awk '{print $1 "\t" $2 "\t" $3}' data.txt

2.7 physical meaning of column 0

   column 0 represents all rows, not all columns, as follows:

awk '{print $0}' data.txt

3. Print line number and column number

   row number refers to the row in which the column is located, while column number refers to the total number of columns in this row.

3.1 separator between columns

   first, suppose that string 1 is s1 and string 2 is s2,

  • Space: indicates string splicing, i.e. s1+s2
  • Comma: indicates splicing through spaces, i.e. s1 + spaces + s2
  • TAB: indicates connection through TAB, i.e. s1+TAB+s2

3.2 print line number

  splice the row number and the original column with spaces:

awk '{print NR, $0}' data.txt


  splice row numbers and original columns through TAB:

awk '{print NR "\t" $0}' data.txt

3.3 print column number

  because in awk, each separator separates two columns of data. For example, if multiple spaces are included in a row, it means that there are multiple columns of data in the row, which will lead to inconsistent number of columns corresponding to different rows. The column number code corresponding to each row is printed as follows:

awk '{print NF, $0}' data.txt

3.4 print the penultimate column N with the column number

   print the last column directly with $NF, and the specific code is:

awk 'print {$NF}' data.txt


   therefore, print the penultimate column N and use $(NF-N-1). For example, the penultimate column is $(NF-1).

awk '{print $(NF-1)}' data.txt

3.5 print the last line

awk 'END {print NR, $0}' data.txt

4. Modify input and output separators

  the default input and output separators are spaces. If the input file is a csv file, you need to change the input separator to comma.

   if the default separator is used, it cannot be disaggregated, that is, the whole data has only one column:

  • Input separator: FS

   the code for modifying the input separator is as follows:

awk 'BEGIN{FS=","} {print $1, $2}' time_name_score.csv

  • Output separator: OFS

   the code for modifying the input and output separators at the same time is as follows:

 awk 'BEGIN{FS=","; OFS="\t"} {print $1, $2}' time_name_score.csv

5. Enter multiple files

   directly arrange multiple file names at the end in order. It should be noted that different files are spliced according to lines:

awk '{print NR, $0}' data.txt data2.txt

6. Modify the value of a column

   sometimes the value of a column needs to be modified uniformly. The specific code is as follows:

awk '$1 = 'silver'; {print $0}' data.txt

7. Print after condition filtering

  • The command to filter and print all ancient coins made of silver is as follows:
awk '$1=="silver" {print $0}' data.txt

  • The order to screen and print all ancient coins produced in 1987 is as follows:
awk '$3==1987 {print $0}' data.txt

  • Print line 7
awk 'NR==7 {print $0}' data.txt
  • Print all rows with 7 columns
awk 'NF==7 {print $0}' data.txt

8. Operation

8.1 mathematical operation

   after entering the following command, press enter again to get the execution result. After obtaining the result, press Ctrl+C to exit:

  • Addition: awk '{a=1; b=2; print a+b}'
  • Subtraction: awk '{a=1; b=2; print a-b}'
  • Multiplication: awk '{a=1; b=2; print a*b}'
  • Division: awk '{a=1; b=2; print a/b}'
  • Remainder: awk '{a=1; b=2; print a%b}'

8.2 string splicing

   string splicing: awk '{a=1; b=2; print a b}'

8.3 hybrid operation

   string splicing can be represented by parentheses, and mathematical operation can be carried out after splicing, as shown below:

awk '{a=1; b=2; c=3; print (a b)+c}'


   if the beginning of the string is a number, it can be operated directly with the number. Otherwise, it is considered that the number represented by the string is 0 (even if there is a number in the middle of the string).

9. Regular expressions

  text search through / regular expression /. For example, if the text includes the string com, the search expression is:

awk '/com/{print $0}' data.txt

Topics: Linux