We have been working with xml files more or less. Its format is very regular, but it can't be read at a glance because there are too many tags (<>), such as the following configuration:
<configuration> <artifactItems> <artifactItem> <groupId>zzz</groupId> <artifactld>aaa</artifactld> </artifactItem> <artifactItem> <groupId>xxx</groupId> <artifactld>yyy</artifactld> </artifactItem> </artifactItems>
The requirement for this case is to extract groupId and artifactld from the XML text above and output them in the following format:
artifactItem:groupId:zzz artifactItem:artifactld:aaa artifactItem:groupId:xxx artifactItem:artifactld:yyy
Point One: Common sense about XML
XML (Extensible Markup Language) is called Extensible Markup Language in Chinese.XML, like HTML, is a markup language.XML is used primarily for transmitting and carrying data information, not for presentation, so it is somewhat difficult to read.
A configuration file for many services is an XML text that defines the corresponding configuration, such as the example text in this case.The main function of XML is to store data, which is stored in plain text format, thus providing a data storage method independent of software and hardware.This makes it easier to create data that different applications can share.Because the format of the XML text is fixed and can be recognized by other operating systems, such as Windows, Linux, or MAC, it is compatible.
One thing we have to know is that XML is an inactive markup language, that is, unlike HTML, which needs to be parsed, executed, and displayed on beautiful web pages, it only exists to structure, store, and transfer information.
Point 2: Intercept the line between two keywords in the document
The requirement is to print out the part of the text that contains the middle of abc and 123, assuming that abc is above 123.If sed is used, a command can be implemented:
# sed -n '/abc/,/123/p' 1.txt
But there are still lines abc and 123, so it's easy to get rid of them:
# sed -n '/abc/,/123/p' 1.txt |sed '/abc/d;/123/d'
If there are more than one abc and 123 in the text, all qualified lines will be printed out at the same time. Here is a more clumsy and dirty way to help you practice your logical thinking ability.
mysed.sh
#!/bin/bash #Get the line numbers of abc and 123 first egrep -n 'abc|123' 1.txt |awk -F ':' '{print $1}' > /tmp/line_number.txt #Calculate the total number of rows containing abc and 123 n=`wc -l /tmp/line_number.txt|awk '{print $1}'` #Calculate the total number of pairs of ABCs and 123 n2=$[$n/2] for i in `seq 1 $n2` do #Each loop handles two lines, 1,2 for the first, 3,4 for the second, and so on m1=$[$i*2-1] m2=$[$i*2] #Get abc and 123 line numbers for each traversal nu1=`sed -n "$m1"p /tmp/line_number.txt` nu2=`sed -n "$m2"p /tmp/line_number.txt` #Gets the line number of the next line under abc nu3=$[$nu1+1] #Get the line number of the line above 123 nu4=$[$nu2-1] #Print the line between abc and 123 with sed sed -n "$nu3,$nu4"p 1.txt #Easy to distinguish, add line separators echo "=============" done
Provide a text 1.txt for the test as follows:
alskdfkjlasldkjfabalskdjflkajsd asldkfjjk232k3jlk2 alskk2lklkkabclaksdj skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf 123asd232323 aaaaaaaaaa 222222222222222222 abcabc12121212 fa2klj slkj32k3j 22233232123 bbbbbbb ddddddddddd
Processing with sed results in:
# sed -n '/abc/,/123/p' 1.txt |sed '/abc/d;/123/d' skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf fa2klj slkj32k3j
With mysed.sh, the result is:
# sh mysed.sh skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf ============= fa2klj slkj32k3j =============
case analysis
1) First, find the middle data segment of <artifactItem> and </artifactItem> and analyze this part of the data.
2) Line numbers containing <artifactItem> and </artifactItem> can be found in the XML document, and then this part can be truncated using sed
3) Processing the intercepted data segments, using sed, awk to intercept keywords and corresponding values
This case reference script
#!/bin/bash #Output XML content as required, this script is highly customizable and not universal #Author: #Date: #Assume the name of the XML document to be processed is test.xml #Get and Locate Line Number grep -n 'artifactItem>' test.xml |awk '{print $1}' |sed 's/://' > /tmp/line_number.txt #How many rows are there for calculating the sum n=`wc -l /tmp/line_number.txt|awk '{print $1}'` #Define a function to get keywords and their values get_value(){ #$1 and $2 are the two parameters of the function, the line number of the next line and the line number of the previous line (this action is below) #Intercept and intermediate content, then get keywords (such as groupId) and their corresponding values, and write/tmp/value.txt sed -n "$1,$2"p test.xml|awk -F '<' '{print $2}'|awk -F '>' '{print $1,$2}' > /tmp/value.txt #Traverse the entire/tmp/value.txt document cat /tmp/value.txt|while read line do #x is a keyword, such as groupId #y is the value of the keyword x=`echo $line|awk '{print $1}'` y=`echo $line|awk '{print $2}'` echo artifactItem:$x:$y done } #Since/tmp/line_number.txt appears in pairs, how many pairs of n2 are there in total n2=$[$n/2] #Print keywords and corresponding values for each pair for j in `seq 1 $n2` do #Each loop handles two lines, 1,2 for the first, 3,4 for the second, and so on m1=$[$j*2-1] m2=$[$j*2] #Get the line number of the sum for each iteration nu1=`sed -n "$m1"p /tmp/line_number.txt` nu2=`sed -n "$m2"p /tmp/line_number.txt` #Get the line number of the next line nu3=$[$nu1+1] #Get the line number of the line above nu4=$[$nu2-1] get_value $nu3 $nu4 done