Process text files for text analysis
REFERENCE:
- https://www.cnblogs.com/ginvip/p/6352157.html
- https://www.runoob.com/linux/linux-comm-awk.html
The Syntax of awk
awk [Option parameters] 'script' var=value file(s) or awk [Option parameters] -f scriptfile var=value file(s)
Option parameters
-
-F fs | --fiel-separator fs
Specify the input file separator. fs is a string or a regular expression, such as ` - F:. -
-v var=value | --asign var=value
Assign a user-defined variable.awk -F: -v i=5 '{ print $3,$(i-2) }' /etc/passwd 0 0 1 1 2 2
-
-f scripfile | --file scriptfile
Read the awk command from the script file.awk -f xxx.awk /etc/passwd
-
-mf nnn & -mr nnn
Set the internal limit for the nnn value, and the - mf option limits the maximum number of blocks allocated to the nnn- The mr option limits the maximum number of records. These two functions are extensions of Bell lab awk and are not applicable in standard awk. -
-W compact | --compat, -W traditional | --traditional
Run awk in compatibility mode. Therefore, gawk behaves exactly like the standard awk, and all awk extensions are ignored. -
-W copyleft | --copyleft, -W copyright | --copyright
Print short copyright information. -
-W help | --help, -W usage | --usage
Print all awk options and a brief description of each option. -
-W lint | --lint
Print warnings about structures that cannot be ported to traditional unix platforms. -
-W lint-old | --lint-old
Print warnings about structures that cannot be ported to traditional unix platforms. -
-W posix
Turn on compatibility mode. However, there are the following limitations: unrecognized: / x, function keyword, func, escape sequence, and when fs is a space, the new line is used as a field separator; Operators * * and * * = cannot replace ^ and ^ =; Invalid fflush. -
-W re-interval | --re-inerval
Allow the use of interval regular expressions, refer to (Posix character class in grep), such as bracket expression [[: alpha:]]. -
-W source program-text | --source program-text
Program text is used as the source code, which can be mixed with the - f command. -
-W version | --version
Print the version of the bug report information.
AWK principle
Just look at lines 20 to 30 in the passwd file
awk '{ if( NR>=20 && NR<=30 ){print $0} }' /etc/passwd mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/false nginx:x:998:996:nginx user:/var/cache/nginx:/sbin/nologin
Know the contents of passwd file, filter out the user name root and command parser / bin/bash, and finally output root /bin/bash
awk -F ':' '{ if( NR==1 )print $1" "$7 }' /etc/passwd root /bin/bash
BEGIN/END module
Count the number of accounts in / etc/passwd
awk '{count++} END{print "[END] The number of users is ",count}' /etc/passwd [END] The number of users is 21
count is a user-defined variable. count is not initialized here. Although it is 0 by default, the safest way is to initialize
awk 'BEGIN{count=0} {count++} END{print "[END] The number of users is ",count}' /etc/passwd [END] The number of users is 21
AWK operator
Description | Operational Character |
---|---|
assignment | = += -= *= /= %= ^= **= |
logic | || && |
regular | ~Match regular expression~ Mismatch regular expression |
relationship | < > <= >= != == |
arithmetic | +- * / & remainder; ^*** Exponentiation; + + – |
other | Whether a key value exists In the In array$ Field reference |
assignment
awk 'BEGIN{ a=5;a+=5;print a }' 10
logic
awk 'BEGIN{ a=0;print ( a>-1||a<0 , a>-1&&a<0 ) }' 1 0
regular
awk 'BEGIN{ str="192,168,10,222";if( str~10 ){print "true"} }' true
echo | awk 'BEGIN{ str="192,168,10,222" } str~10 {print "true"}' true
relationship
awk 'BEGIN{ a=0;print (a<0,a==0,a>0) }' 0 1 0
< > you can compare strings and numeric values.
awk 'BEGIN{ a="11";if(a>=9){print "true"} }' # No output, compare ASCII order awk 'BEGIN{ a=11;if(a>=9){print "true"} }' true
arithmetic
Operands are automatically converted to numeric values by arithmetic operators, and all non numeric values become 0
awk 'BEGIN{ a="b";b="2b";print a,b,a++,b++ }' b 2b 0 2
Others: binocular operation
awk 'BEGIN{ a="b";print a=="b"?1:0 }' 1
AWK built in variables
Variate | Description | Default | |
---|---|---|---|
$0 | Current record | ||
$1~$n | The nth field of the current record | ||
FS | Field Separator | Enter field separator | Space |
RS | Record Separator | Enter record separator | \t |
NF | Number Of Field | The number of fields in the current record; Total number of columns | |
NR | Number Of Record | Number of current records; Line number | |
OFS | Output Field Separator | Output field separator | Space |
ORS | Output Record Separator | Output record separator | \t |
FS field separator
Line feed
awk 'BEGIN{ FS="\t+" }{ print $0 }' xxx.md # One or more Tab delimiters
Space
awk -F [[:space:]+] '{ print $0 }' xxx.md # One or more spaces
Multiple separators
awk -F '[ :/]' 'BEGIN{ OFS="\t" }{ print $2,$3,$9 }' /etc/passwd x 0 bin x 1 sbin x 2 sbin
RS record separator ⭐ ️
awk 'BEGIN{ RS="" }{ print $0 }' /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin ················· mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/false nginx:x:998:996:nginx user:/var/cache/nginx:/sbin/nologin awk 'BEGIN{ RS="" }{ print $1 }' /etc/passwd root:x:0:0:root:/root:/bin/bash awk 'BEGIN{ RS="" }{ print $2 }' /etc/passwd bin:x:1:1:bin:/bin:/sbin/nologin
Number of NF fields
awk -F "/" 'NF==5{print $0}' /etc/passwd # Print by / with 5 fields adm:x:3:4:adm:/var/adm:/sbin/nologin games:x:12:100:games:/usr/games:/sbin/nologin ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin ntp:x:38:38::/etc/ntp:/sbin/nologin
NR record quantity
awk 'NR==1{print $0}' /etc/passwd # NR==1, take the first row of data root:x:0:0:root:/root:/bin/bash
OFS output field separator
slightly
ORS output record separator
slightly
IGNORECASE ignores case
awk 'BEGIN{ IGNORECASE=1 } /user/' /etc/passwd ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin polkitd:x:999:998:User for polkitd:/:/sbin/nologin nginx:x:998:996:nginx user:/var/cache/nginx:/sbin/nologin
AWK regular expression
Character | Function | Samples | Interpretation |
---|---|---|---|
^ | Line beginning locator | /^root/ | Match all lines starting with root |
$ | End of line locator | /root$/ | Match all lines ending with root |
. | Match 1 character | /r..t/ | Match four strings with r as the head and t as the tail |
* | Match [0, + ∞) leading characters | /roo*t/ | |
+ | Match [1, + ∞) leading characters | /ro+t/ | |
? | Match [0,1] leading characters | /r?oot/ | |
[] | Match any character in [] | /^[abc]/ | Match lines starting with a, b, c |
[^] | Matches characters that are not in [^] | /^[^ab]/ | Match lines that do not start with a, b |
() | Subexpression combination | /(root)+/ | [1, + ∞) root combinations |
| | perhaps | /(root)|(user)/ | Rows matching root or user |
\ | Escape character | /a\/ | Match a/ |
~ | matching | $1~/root/ | Matches the line whose first field contains the character root |
!~ | Mismatch | $1!~/root/ | |
x{m} | x repeat m times | /[rot]{4}/ | Matches a row of four consecutive characters all composed of rot |
x{m,} | x repeat m times or more | ||
x{m,n} | x repeat m~n times |
Regular expression
awk '/REG/{ACTION}' FILE # /REG / is a regular expression, which can send the qualified records in $0 to ACTION for processing
awk '/root/{print $0}' /etc/passwd # Match rows containing root
Boolean expression
awk 'BOOLEAN{ACTION}' FILE # ACTION is executed by awk only when BOOLEAN value is TRUE
awk -F: '$1=="root"{print $0}' /etc/passwd root:x:0:0:root:/root:/bin/bash
condition loop
if
if($1=="root"){ print $0 }
while
do while
count=1 do{ print $1 } while( count !=1 )
for
for( i=1;i<10;i++){ print $1 }
array
Use awk to view the server connection status and summarize
netstat -an|awk '/^tcp/{++s[$NF]}END{for(a in s)print a,s[a]}' ESTABLISHED 1 LISTEN 20
Statistics of web log access traffic, required output access times, requested pages or pictures, total size of each request, and summary of total access traffic
awk '{a[$7]+=$10;++b[$7];total+=$10}END{for(x in a)print b[x],x,a[x]|"sort -rn -k1";print "total size is :"total}' /app/log/access_log total size is :172230 21 /icons/poweredby.png 83076 14 / 70546 8 /icons/apache_pb.gif 18608
a[$7]+=$10 Represents an array with column 7 as the subscript( $10 List as $7 The size of the columns) and add up their sizes $7 The size of each access, followed by for There's a trick in the loop, a and b The subscripts of the array are the same, so one for Sentence is enough
String function
Function | Description |
---|---|
gsub( Ere,Repl,[In] ) | |
sub( Ere,Repl,[In] ) | |
index( String1,String2 ) | |
length[( String )] | String length |
blength[( String )] | String length in bytes |
substr( String,M,[N] ) | String interception |
match( String,Ere ) | |
split( String,A,[Ere] ) | |
tolower( String ) | |
toupper( String ) | |
sprintf( Format,Expr,Expr,... ) |
gsub replacement
awk 'BEGIN{ str="abc123abc";gsub(/[0-9]+/,"!",str);print str }' abc!abc
Find the substring satisfying the regular expression in str, and use! Replace and return the replaced value to str
index lookup
awk 'BEGIN{ str="abc123abc";print index(str,"abc")?"true":"false" }' true # Non zero if found
Match match lookup
awk 'BEGIN{ str="abc123abc";print match(str,/[0-9]+/) }' 4
substr interception
awk 'BEGIN{ str="abc123abc";print substr(str,4,6) }' 123abc
Exercise
Format output
awk -F: '{printf "%-8s %-10s\n",$1,$6 }' /etc/passwd root /root bin /bin daemon /sbin
Operator: filter rows with the third column less than 3
awk -F: '$3<3' /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin
Calculate file size
ls -l | awk '{ sum+=$5 } END{ print sum }' 1535
Find lines longer than 60 from the file
awk 'length>60' /etc/passwd systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
Print 99 multiplication table
seq 9 | sed 'H;g' | awk -v RS='' '{ for(i=1;i<=NF;i++)printf("%dx%d=%d%s",i,NR,i*NR,i==NR?"\n":"\t") }' 1x1=1 1x2=2 2x2=4 1x3=3 2x3=6 3x3=9 1x4=4 2x4=8 3x4=12 4x4=16 1x5=5 2x5=10 3x5=15 4x5=20 5x5=25 1x6=6 2x6=12 3x6=18 4x6=24 5x6=30 6x6=36 1x7=7 2x7=14 3x7=21 4x7=28 5x7=35 6x7=42 7x7=49 1x8=8 2x8=16 3x8=24 4x8=32 5x8=40 6x8=48 7x8=56 8x8=64 1x9=9 2x9=18 3x9=27 4x9=36 5x9=45 6x9=54 7x9=63 8x9=72 9x9=81