Chapter 6 text processing tool for Shell programming (awk)

Posted by mightymaster on Sun, 13 Feb 2022 05:08:08 +0100

1, awk introduction

1. awk overview

awk is a programming language, which is mainly used to process text and data under linux/unix. It is a tool under linux/unix. The data can come from standard input, one or more files, or the output of other commands.
awk's way of processing text and data: scan the file line by line. By default, from the first line to the last line, find the lines matching the specific pattern, and do the operation you want on these lines.
awk stands for the first letter of the author's last name. Because it is written by three people, Alfred Aho, Brian Kernighan and Peter Weinberger.
gawk is the GNU version of awk, which provides some extensions of Bell Labs and GNU.
The awk described below takes gawk of GNU as an example. Awk has been linked to gawk in linux system, so all the following are introduced in awk.

2. What can awk do?

awk is used to process files and data. It is not only a tool under unix, but also a programming language
It can be used to make statistics, such as the number of visits to the website, the number of IP visits, etc
Support condition judgment and for and while loops

2, awk usage

1. Using command line mode

I. grammatical structure

awk option 'Command part' file name


Special note:
quote shell Variables need to be enclosed in double quotes

II. Introduction to common options

-F defines the field separator. The default separator is a space
-v define variables and assign values

###III. description of naming part

Regular expression, address location

'/root/{awk sentence}'				sed Medium: '/root/p'
'NR==1,NR==5{awk sentence}'			sed Medium: '1,5p'
'/^root/,/^ftp/{awk sentence}'  	sed Medium:'/^root/,/^ftp/p'

{awk statement 1**;awk statement 2; * *...}

'{print $0;print $1}'		sed Medium:'p'
'NR==5{print $0}'				sed Medium:'5p'
Note: awk Semicolon spacing between command statements

BEGIN...END...

'BEGIN{awk sentence};{Processing};END{awk sentence}'
'BEGIN{awk sentence};{Processing}'
'{Processing};END{awk sentence}'

2. Using script mode

I. scripting

#!/ bin/awk -f  		 Define magic characters
 Here is awk For the list of commands in quotation marks, do not use quotation marks to protect commands. Multiple commands are separated by semicolons
BEGIN{FS=":"}
NR==1,NR==3{print $1"\t"$NF}
...

II. Script execution

Method 1:
awk option -f awk Script file to process text file
awk -f awk.sh filename

sed -f sed.sh -i filename

Method 2:
./awk Script file for(Or absolute path)	Text file to process
./awk.sh filename

./sed.sh filename

3, awk internal related variables

variable	Variable description	remarks
$0	All records of the current processing line
$1,$2,$3...$n	Different fields in the file that are separated by an interval symbol for each line	awk -F: '{print $1,$3}'
NF	Number of fields (columns) of the current record	awk -F: '{print NF}'
$NF	Last column	$(NF-1) indicates the penultimate column
FNR/NR	Line number
FS	Define spacer	'BEGIN{FS=":"};{print $1,$3}'
OFS	Define the output field separator, default space	'BEGIN{OFS="\t"};print $1,$3}'
RS	Enter the record separator, and the default is line feed	'BEGIN{RS="\t"};{print $0}'
ORS	Output record separator, default line break	'BEGIN{ORS="\n\n"};{print $1,$3}'
FILENAME	Currently entered file name

1. Examples of common built-in variables

# awk -F: '{print $1,$(NF-1)}' 1.txt
# awk -F: '{print $1,$(NF-1),$NF,NF}' 1.txt
# awk '/root/{print $0}' 1.txt
# awk '/root/' 1.txt
# awk -F: '/root/{print $1,$NF}' 1.txt 
root /bin/bash
# awk -F: '/root/{print $0}' 1.txt      
root:x:0:0:root:/root:/bin/bash
# awk 'NR==1,NR==5' 1.txt 
# awk 'NR==1,NR==5{print $0}' 1.txt
# awk 'NR==1,NR==5;/^root/{print $0}' 1.txt 
root:x:0:0:root:/root:/bin/bash
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

2. Examples of built-in variable separators

FS and OFS:
# awk 'BEGIN{FS=":"};/^root/,/^lp/{print $1,$NF}' 1.txt
# awk -F: 'BEGIN{OFS="\t\t"};/^root/,/^lp/{print $1,$NF}' 1.txt        
root            /bin/bash
bin             /sbin/nologin
daemon          /sbin/nologin
adm             /sbin/nologin
lp              /sbin/nologin
# awk -F: 'BEGIN{OFS="@@@"};/^root/,/^lp/{print $1,$NF}' 1.txt     
root@@@/bin/bash
bin@@@/sbin/nologin
daemon@@@/sbin/nologin
adm@@@/sbin/nologin
lp@@@/sbin/nologin
[root@server shell07]# 

RS and ORS: 
Add tabs and contents in the first 2 lines of the modified source file:
vim 1.txt
root:x:0:0:root:/root:/bin/bash hello   world
bin:x:1:1:bin:/bin:/sbin/nologin        test1   test2

# awk 'BEGIN{RS="\t"};{print $0}' 1.txt
# awk 'BEGIN{ORS="\t"};{print $0}' 1.txt

4, awk working principle

awk -F: '{print $1,$3}' /etc/passwd

awk uses a line as input and assigns this line to the internal variable $0. Each line can also be called a record and ends with a newline character (RS)
Each line is broken down into fields (or fields) by the separator * *: * * (default is space or tab), and each field is stored in a numbered variable, starting with $1

Q: how does awk know how to separate fields with spaces?

A: because there is an internal variable fs to determine the field separator. Initially, FS is assigned as a space
awk uses the print function to print fields, which are separated by spaces because there is a comma between $1 and $3. Comma is special. It is mapped to another internal variable, which is called output field separator OFS. OFS defaults to space
After awk processes one line, it will get another line from the file and store it in $0, overwrite the original content, and then separate the new string into fields and process it. This process will continue until all rows are processed

5, awk use advanced

1. Format output print and printf

print function		similar echo "hello world"
# date |awk '{print "Month: "$2 "\nYear: "$NF}'
# awk -F: '{print "username is: " $1 "\t uid is: "$3}' /etc/passwd


printf function		similar echo -n
# awk -F: '{printf "%-15s %-10s %-15s\n", $1,$2,$3}'  /etc/passwd
# awk -F: '{printf "|%15s| %10s| %15s|\n", $1,$2,$3}' /etc/passwd
# awk -F: '{printf "|%-15s| %-10s| %-15s|\n", $1,$2,$3}' /etc/passwd

awk 'BEGIN{FS=":"};{printf "%-15s %-15s %-15s\n",$1,$6,$NF}' a.txt

%s Character type  strings			%-20s
%d value type	
15 characters
- Indicates left alignment, and the default is right alignment
printf By default, the line will not wrap automatically at the end of the line, plus\n

2. awk variable definition

# awk -v NUM=3 -F: '{ print $NUM }' /etc/passwd
# awk -v NUM=3 -F: '{ print NUM }' /etc/passwd
# awk -v num=1 'BEGIN{print num}' 
1
# awk -v num=1 'BEGIN{print $num}' 
be careful:
awk The variables defined in the call do not need to be added. $

3. BEGIN... END in awk

① BEGIN: it means to execute before the program starts

② END: it means to execute after all documents are processed

③ usage: 'BEGIN {before processing}; {processing}; END {after processing} '

I. example 1

Print the last and penultimate columns (login shell and home directory)

awk -F: 'BEGIN{ print "Login_shell\t\tLogin_home\n*******************"};{print $NF"\t\t"$(NF-1)};END{print "************************"}' 1.txt

awk 'BEGIN{ FS=":";print "Login_shell\tLogin_home\n*******************"};{print $NF"\t"$(NF-1)};END{print "************************"}' 1.txt

Login_shell		Login_home
************************
/bin/bash		/root
/sbin/nologin		/bin
/sbin/nologin		/sbin
/sbin/nologin		/var/adm
/sbin/nologin		/var/spool/lpd
/bin/bash		/home/redhat
/bin/bash		/home/user01
/sbin/nologin		/var/named
/bin/bash		/home/u01
/bin/bash		/home/YUNWEI
************************************

II. Examples 2

Print the user name, home directory and login shell in / etc/passwd

u_name      h_dir       shell
***************************

***************************

awk -F: 'BEGIN{OFS="\t\t";print"u_name\t\th_dir\t\tshell\n***************************"};{printf "%-20s %-20s %-20s\n",$1,$(NF-1),$NF};END{print "****************************"}'


# awk -F: 'BEGIN{print "u_name\t\th_dir\t\tshell" RS "*****************"}  {printf "%-15s %-20s %-20s\n",$1,$(NF-1),$NF}END{print "***************************"}'  /etc/passwd

Format output:
echo		print
echo -n	printf

{printf "%-15s %-20s %-20s\n",$1,$(NF-1),$NF}

###4. Comprehensive application of awk and regularization

operator	explain
==	be equal to
!=	Not equal to
>	greater than
<	less than
>=	Greater than or equal to
<=	Less than or equal to
~	matching
!~	Mismatch
!	Logical non
&&	Logic and
\|\|	Logical or

One example

Match from the first line to lp Opening line
awk -F: 'NR==1,/^lp/{print $0 }' passwd  
From line 1 to line 5          
awk -F: 'NR==1,NR==5{print $0 }' passwd
 From lp The first line matches to line 10       
awk -F: '/^lp/,NR==10{print $0 }' passwd 
From root The line beginning with matches to the line beginning with lp First line       
awk -F: '/^root/,/^lp/{print $0}' passwd
 Print to root Begin or begin with lp First line            
awk -F: '/^root/ || /^lp/{print $0}' passwd
awk -F: '/^root/;/^lp/{print $0}' passwd
 Display 5-10 that 's ok   
awk -F':' 'NR>=5 && NR<=10 {print $0}' /etc/passwd     
awk -F: 'NR<10 && NR>5 {print $0}' passwd 

Print 30-39 Line to bash End:
[root@MissHou shell06]# awk 'NR>=30 && NR<=39 && $0 ~ /bash$/{print $0}' passwd 
stu1:x:500:500::/home/stu1:/bin/bash
yunwei:x:501:501::/home/yunwei:/bin/bash
user01:x:502:502::/home/user01:/bin/bash
user02:x:503:503::/home/user02:/bin/bash
user03:x:504:504::/home/user03:/bin/bash

[root@MissHou shell06]# awk 'NR>=3 && NR<=8 && /bash$/' 1.txt  
stu7:x:1007:1007::/rhome/stu7:/bin/bash
stu8:x:1008:1008::/rhome/stu8:/bin/bash
stu9:x:1009:1009::/rhome/stu9:/bin/bash

Print file 1-5 And with root First line
[root@MissHou shell06]# awk 'NR>=1 && NR<=5 && $0 ~ /^root/{print $0}' 1.txt
root:x:0:0:root:/root:/bin/bash
[root@MissHou shell06]# awk 'NR>=1 && NR<=5 && $0 !~ /^root/{print $0}' 1.txt
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin


understand;Number and||Meaning of:
[root@MissHou shell06]# awk 'NR>=3 && NR<=8 || /bash$/' 1.txt
[root@MissHou shell06]# awk 'NR>=3 && NR<=8;/bash$/' 1.txt


Print IP address
# ifconfig eth0|awk 'NR>1 {print $2}'|awk -F':' 'NR<2 {print $2}'    
# ifconfig eth0|grep Bcast|awk -F':' '{print $2}'|awk '{print $1}'
# ifconfig eth0|grep Bcast|awk '{print $2}'|awk -F: '{print $2}'


# ifconfig eth0|awk NR==2|awk -F '[ :]+' '{print $4RS$6RS$8}'
# ifconfig eth0|awk -F"[ :]+" '/inet addr:/{print $4}'

4. Practice cases

Display all information of users who can log in to the operating system, match from column 7 and end with bash, and output the whole line (all columns of the current line)

[root@MissHou ~] awk '/bash$/{print $0}'    /etc/passwd
[root@MissHou ~] awk '/bash$/{print $0}' /etc/passwd
[root@MissHou ~] awk '/bash$/' /etc/passwd
[root@MissHou ~] awk -F: '$7 ~ /bash/' /etc/passwd
[root@MissHou ~] awk -F: '$NF ~ /bash/' /etc/passwd
[root@MissHou ~] awk -F: '$0 ~ /bash/' /etc/passwd
[root@MissHou ~] awk -F: '$0 ~ /\/bin\/bash/' /etc/passwd

Displays the user name that can log in to the system

# awk -F: '$0 ~ /\/bin\/bash/{print $1}' /etc/passwd

Print out the UID and user name of ordinary users in the system

500	stu1
501	yunwei
502	user01
503	user02
504	user03


# awk -F: 'BEGIN{print "UID\tUSERNAME"} {if($3>=500 && $3 !=65534 ) {print $3"\t"$1} }' /etc/passwdUID	USERNAME


# awk -F: '{if($3 >= 500 && $3 != 65534) print $1,$3}' a.txt 
redhat 508
user01 509
u01 510
YUNWEI 511

5. Script programming of awk

A flow control statement

① if structure

if sentence:

if [ xxx ];then
xxx
fi

Format:
awk option 'Regular, address location{awk sentence}'  file name

{ if(expression)｛Statement 1;Statement 2;...｝}

awk -F: '{if($3>=500 && $3<=60000) {print $1,$3} }' passwd

# awk -F: '{if(==0) {print ' is administrator '}}' passwd 
root It's an administrator

# awk 'BEGIN{if('$(id -u)'==0) {print "admin"} }'
admin

② if... else structure

if...else sentence:
if [ xxx ];then
	xxxxx
	
else
	xxx
fi

Format:
{if(expression)｛sentence;sentence;...｝else｛sentence;sentence;...}}

awk -F: '{ if($3>=500 && $3 != 65534) {print $1"It's an ordinary user"} else {print $1,"Not an ordinary user"}}' passwd 

awk 'BEGIN{if( '$(id -u)'>=500 && '$(id -u)' !=65534 ) {print "It's an ordinary user"} else {print "Not an ordinary user"}}'

③ if... elif... else structure

if [xxxx];then
	xxxx
elif [xxx];then
	xxx
....
else
...
fi


if...else if...else sentence:

Format:
{ if(Expression 1)｛sentence;sentence;...｝else if(Expression 2)｛sentence;sentence;...｝else if(Expression 3)｛sentence;sentence;...｝else｛sentence;sentence;...｝}

awk -F: '{ if($3==0) {print $1,":It's an administrator"} else if($3>=1 && $3<=499 || $3==65534 ) {print $1,":Is a system user"} else {print $1,":It's an ordinary user"}}'


awk -F: '{ if($3==0) {i++} else if($3>=1 && $3<=499 || $3==65534 ) {j++} else {k++}};END{print "The number of administrators is:"i "\n The number of system users is:"j"\n The number of ordinary users is:"k }'


# awk -F: '{if($3==0) {print $1,"is admin"} else if($3>=1 && $3<=499 || $3==65534) {print $1,"is sys users"} else {print $1,"is general user"} }' a.txt 

root is admin
bin is sys users
daemon is sys users
adm is sys users
lp is sys users
redhat is general user
user01 is general user
named is sys users
u01 is general user
YUNWEI is general user

awk -F: '{  if($3==0) {print $1":administrators"} else if($3>=1 && $3<500 || $3==65534 ) {print $1":Is a system user"} else {print $1":It's an ordinary user"}}'   /etc/passwd


awk -F: '{if($3==0) {i++} else if($3>=1 && $3<500 || $3==65534){j++} else {k++}};END{print "The number of administrators is:" i RS "The number of system users is:"j RS "The number of ordinary users is:"k }' /etc/passwd
 The number of administrators is:1
 The number of system users is:28
 The number of ordinary users is:27


# Awk - F: '{if ($3 = = 0) {print $1 ": Administrator"} else if ($3 > = 500 & & $3! = 65534) {print $1 ": ordinary user"} else {print $1 ": system user"}}' passwd 

awk -F: '{if($3==0){i++} else if($3>=500){k++} else{j++}} END{print i; print k; print j}' /etc/passwd

awk -F: '{if($3==0){i++} else if($3>999){k++} else{j++}} END{print "Number of administrators: "i; print "Number of ordinary: "k; print "System user: "j}' /etc/passwd 

If it is an ordinary user, print the default shell，If it is a system user, print the user name
# awk -F: '{if($3>=1 && $3<500 || $3 == 65534) {print $1} else if($3>=500 && $3<=60000 ) {print $NF} }' /etc/passwd

Two loop statement

① for loop

Print 1~5
for ((i=1;i<=5;i++));do echo $i;done

# awk 'BEGIN { for(i=1;i<=5;i++) {print i} }'
Print 1~10 Odd number in
# for ((i=1;i<=10;i+=2));do echo $i;done|awk '{sum+=$0};END{print sum}'
# awk 'BEGIN{ for(i=1;i<=10;i+=2) {print i} }'
# awk 'BEGIN{ for(i=1;i<=10;i+=2) print i }'

Calculation 1-5 And
# awk 'BEGIN{sum=0;for(i=1;i<=5;i++) sum+=i;print sum}'
# awk 'BEGIN{for(i=1;i<=5;i++) (sum+=i);{print sum}}'
# awk 'BEGIN{for(i=1;i<=5;i++) (sum+=i);print sum}'

② while loop

Print 1-5
# i=1;while (($i<=5));do echo $i;let i++;done

# awk 'BEGIN { i=1;while(i<=5) {print i;i++} }'
Print 1~10 Odd number in
# awk 'BEGIN{i=1;while(i<=10) {print i;i+=2} }'
Calculation 1-5 And
# awk 'BEGIN{i=1;sum=0;while(i<=5) {sum+=i;i++}; print sum }'
# awk 'BEGIN {i=1;while(i<=5) {(sum+=i) i++};print sum }'

③ Nested loop

Nested loop:
#!/bin/bash
for ((y=1;y<=5;y++))
do
	for ((x=1;x<=$y;x++))
	do
		echo -n $x	
	done
echo
done

awk 'BEGIN{ for(y=1;y<=5;y++) {for(x=1;x<=y;x++) {printf x} ;print } }'


# awk 'BEGIN { for(y=1;y<=5;y++) { for(x=1;x<=y;x++) {printf x};print} }'
1
12
123
1234
12345

# awk 'BEGIN{ y=1;while(y<=5) { for(x=1;x<=y;x++) {printf x};y++;print}}'
1
12
123
1234
12345

Try printing the 99 formula table in three ways:
#awk 'BEGIN{for(y=1;y<=9;y++) { for(x=1;x<=y;x++) {printf x"*"y"="x*y"\t"};print} }'

#awk 'BEGIN{for(y=1;y<=9;y++) { for(x=1;x<=y;x++) printf x"*"y"="x*y"\t";print} }'
#awk 'BEGIN{i=1;while(i<=9){for(j=1;j<=i;j++) {printf j"*"i"="j*i"\t"};print;i++ }}'

#awk 'BEGIN{for(i=1;i<=9;i++){j=1;while(j<=i) {printf j"*"i"="i*j"\t";j++};print}}'

Cycle control:
break		Interrupt the loop when the conditions are met
continue	Skip the loop when the condition is met
# awk 'BEGIN{for(i=1;i<=5;i++) {if(i==3) break;print i} }'
1
2
# awk 'BEGIN{for(i=1;i<=5;i++){if(i==3) continue;print i}}'
1
2
4
5

6. awk arithmetic operation

+ - * / %(model) ^(Power 2^3)
You can perform calculations in mode, awk Will perform arithmetic operations as floating-point numbers
# awk 'BEGIN{print 1+1}'
# awk 'BEGIN{print 1**1}'
# awk 'BEGIN{print 2**3}'
# awk 'BEGIN{print 2/3}'

6, awk statistical case

1. Various types of shell s in statistical system

# awk -F: '{ shells[$NF]++ };END{for (i in shells) {print i,shells[i]} }' /etc/passwd

books[linux]++
books[linux]=1
shells[/bin/bash]++
shells[/sbin/nologin]++

/bin/bash 5
/sbin/nologin 6

shells[/bin/bash]++			a
shells[/sbin/nologin]++		b
shells[/sbin/shutdown]++	c

books[linux]++
books[php]++

2. Statistics of website access status

# ss -antp|grep 80|awk '{states[$1]++};END{for(i in states){print i,states[i]}}'
TIME_WAIT 578
ESTABLISHED 1
LISTEN 1

# ss -an |grep :80 |awk '{states[$2]++};END{for(i in states){print i,states[i]}}'
LISTEN 1
ESTAB 5
TIME-WAIT 25

# ss -an |grep :80 |awk '{states[$2]++};END{for(i in states){print i,states[i]}}' |sort -k2 -rn
TIME-WAIT 18
ESTAB 8
LISTEN 1

3. Count the number of each IP accessing the website

# netstat -ant |grep :80 |awk -F: '{ip_count[$8]++};END{for(i in ip_count){print i,ip_count[i]} }' |sort


# ss -an |grep :80 |awk -F":" '!/LISTEN/{ip_count[$(NF-1)]++};END{for(i in ip_count){print i,ip_count[i]}}' |sort -k2 -rn |head

4. Count the amount of PV in the website log

Statistics Apache/Nginx Of a day in the log PV amount 　<Statistical log>
# grep '27/Jul/2017' mysqladmin.cc-access_log |wc -l
14519

Statistics Apache/Nginx A day in the log is different IP Number of visits　<Statistical log>
# grep '27/Jul/2017' mysqladmin.cc-access_log |awk '{ips[$1]++};END{for(i in ips){print i,ips[i]} }' |sort -k2 -rn |head

# grep '07/Aug/2017' access.log |awk '{ips[$1]++};END{for(i in ips){print i,ips[i]} }' |awk '$2>100' |sort -k2 -rn

Explanation of terms:

Website views (PV)
Noun: PV=PageView
Description: refers to the number of page views, which is used to measure the number of web pages visited by website users. If the same page is opened multiple times, the total number of views is accumulated. Users record PV once every time they open a page.

Noun: VV = Visit View
Note: all pages from visitors coming to your website to the final closing of the website are counted as one visit. If the visitor does not open or refresh the page for 30 consecutive minutes, or the visitor closes the browser, it will be counted as the end of this visit.

Unique visitors (UV)
Noun: UV= Unique Visitor
Note: only one UV is calculated when the same visitor visits your website multiple times in one day.

Independent IP (IP)
Noun: IP = number of independent IPS
Note: refers to the number of users who use different IP addresses to visit the website within one day. No matter how many pages the same IP accesses, the number of independent IPS is 1

#7, Homework after class

Assignment 1:
1. Write a script to automatically detect the disk usage. When the disk usage reaches more than 90%, you need to send an email to relevant personnel
2. Write a script to monitor the system memory and swap partition usage

Assignment 2:
Enter an IP address to judge its legitimacy:
It must comply with the ip address specification. The 1st and 4th bits cannot start with 0, cannot be greater than 255, and cannot be less than 0

#8, Practical cases of enterprises

1. Mandate / background

There are a total of 9 machines in the web server cluster, on which Apache services are deployed. Due to the continuous growth of business, a large number of access logs will be generated on each machine every day. Now it is necessary to keep the Apache access logs on each web server for the last three days, and dump the logs three days ago to a special log server for subsequent analysis. How to keep logs on each server for less than 3 days?

2. Specific requirements

The log of each web server is in the corresponding directory of the log server. For example: web1 - > web1 Log (on log server)
The access logs of the last three days are kept on each web server, and the logs of the past three days are dumped to the log server at 5:03 a.m. every day
If the script dump fails, the operation and maintenance personnel need to manually clean the log through the menu of the springboard machine

3. Knowledge points involved

Basic syntax structure of shell
File synchronization rsync
File lookup command find
Schedule task crontab
apache log cutting
other

Topics: Linux Unix bash

Programmer Think