Case 15. Format output xml file

Posted by v4g on Sat, 17 Aug 2019 05:25:56 +0200

We have been working with xml files more or less. Its format is very regular, but it can't be read at a glance because there are too many tags (<>), such as the following configuration:

<configuration>
     <artifactItems>
          <artifactItem>
              <groupId>zzz</groupId>
              <artifactld>aaa</artifactld>
          </artifactItem>
          <artifactItem>
              <groupId>xxx</groupId>
              <artifactld>yyy</artifactld>
          </artifactItem>
</artifactItems>

The requirement for this case is to extract groupId and artifactld from the XML text above and output them in the following format:

artifactItem:groupId:zzz
artifactItem:artifactld:aaa
artifactItem:groupId:xxx
artifactItem:artifactld:yyy


Point One: Common sense about XML

XML (Extensible Markup Language) is called Extensible Markup Language in Chinese.XML, like HTML, is a markup language.XML is used primarily for transmitting and carrying data information, not for presentation, so it is somewhat difficult to read.


A configuration file for many services is an XML text that defines the corresponding configuration, such as the example text in this case.The main function of XML is to store data, which is stored in plain text format, thus providing a data storage method independent of software and hardware.This makes it easier to create data that different applications can share.Because the format of the XML text is fixed and can be recognized by other operating systems, such as Windows, Linux, or MAC, it is compatible.


One thing we have to know is that XML is an inactive markup language, that is, unlike HTML, which needs to be parsed, executed, and displayed on beautiful web pages, it only exists to structure, store, and transfer information.


Point 2: Intercept the line between two keywords in the document

The requirement is to print out the part of the text that contains the middle of abc and 123, assuming that abc is above 123.If sed is used, a command can be implemented:

# sed -n '/abc/,/123/p' 1.txt

But there are still lines abc and 123, so it's easy to get rid of them:

# sed -n '/abc/,/123/p' 1.txt |sed '/abc/d;/123/d'

If there are more than one abc and 123 in the text, all qualified lines will be printed out at the same time. Here is a more clumsy and dirty way to help you practice your logical thinking ability.

mysed.sh

#!/bin/bash
#Get the line numbers of abc and 123 first
egrep -n 'abc|123' 1.txt |awk -F ':' '{print $1}' > /tmp/line_number.txt

#Calculate the total number of rows containing abc and 123
n=`wc -l /tmp/line_number.txt|awk '{print $1}'`

#Calculate the total number of pairs of ABCs and 123
n2=$[$n/2]

for i in `seq 1 $n2`
do
    #Each loop handles two lines, 1,2 for the first, 3,4 for the second, and so on
    m1=$[$i*2-1]
    m2=$[$i*2]

    #Get abc and 123 line numbers for each traversal
    nu1=`sed -n "$m1"p /tmp/line_number.txt`
    nu2=`sed -n "$m2"p /tmp/line_number.txt`

    #Gets the line number of the next line under abc
    nu3=$[$nu1+1]

     #Get the line number of the line above 123
    nu4=$[$nu2-1]
    
    #Print the line between abc and 123 with sed
    sed -n "$nu3,$nu4"p 1.txt

    #Easy to distinguish, add line separators
    echo "============="
done

Provide a text 1.txt for the test as follows:

alskdfkjlasldkjfabalskdjflkajsd
asldkfjjk232k3jlk2
alskk2lklkkabclaksdj
skjjfk23kjalf09wlkjlah lkaswlekjl9
aksjdf
123asd232323
aaaaaaaaaa
222222222222222222
abcabc12121212
fa2klj
slkj32k3j
22233232123
bbbbbbb
ddddddddddd

Processing with sed results in:

# sed -n '/abc/,/123/p' 1.txt |sed '/abc/d;/123/d'
skjjfk23kjalf09wlkjlah lkaswlekjl9
aksjdf
fa2klj
slkj32k3j

With mysed.sh, the result is:

# sh mysed.sh 
skjjfk23kjalf09wlkjlah lkaswlekjl9
aksjdf
=============
fa2klj
slkj32k3j
=============


case analysis

1) First, find the middle data segment of <artifactItem> and </artifactItem> and analyze this part of the data.

2) Line numbers containing <artifactItem> and </artifactItem> can be found in the XML document, and then this part can be truncated using sed

3) Processing the intercepted data segments, using sed, awk to intercept keywords and corresponding values


This case reference script

#!/bin/bash
#Output XML content as required, this script is highly customizable and not universal
#Author:
#Date:

#Assume the name of the XML document to be processed is test.xml
#Get and Locate Line Number
grep -n 'artifactItem>' test.xml |awk '{print $1}' |sed 's/://' > /tmp/line_number.txt

#How many rows are there for calculating the sum
n=`wc -l /tmp/line_number.txt|awk '{print $1}'`

#Define a function to get keywords and their values
get_value(){
    #$1 and $2 are the two parameters of the function, the line number of the next line and the line number of the previous line (this action is below)
    #Intercept and intermediate content, then get keywords (such as groupId) and their corresponding values, and write/tmp/value.txt
    sed -n "$1,$2"p test.xml|awk -F '<' '{print $2}'|awk -F '>' '{print $1,$2}' > /tmp/value.txt

    #Traverse the entire/tmp/value.txt document
    cat /tmp/value.txt|while read line
    do
        #x is a keyword, such as groupId
        #y is the value of the keyword
        x=`echo $line|awk '{print $1}'`
        y=`echo $line|awk '{print $2}'`
        echo artifactItem:$x:$y
    done
}

#Since/tmp/line_number.txt appears in pairs, how many pairs of n2 are there in total
n2=$[$n/2]

#Print keywords and corresponding values for each pair
for j in `seq 1 $n2`
do
    #Each loop handles two lines, 1,2 for the first, 3,4 for the second, and so on
    m1=$[$j*2-1]
    m2=$[$j*2]

    #Get the line number of the sum for each iteration
    nu1=`sed -n "$m1"p /tmp/line_number.txt`
    nu2=`sed -n "$m2"p /tmp/line_number.txt`

    #Get the line number of the next line
    nu3=$[$nu1+1]

     #Get the line number of the line above
    nu4=$[$nu2-1]

    get_value $nu3 $nu4
done


Topics: Linux xml less Windows