There are three most common ways for python to parse XML?

Posted by westair on Fri, 11 Feb 2022 03:05:52 +0100

XML (eXtensible Markup Language) refers to eXtensible Markup Language, which is designed to transmit and store data. It has increasingly become the core of many new technologies and has different applications in different fields. It is the inevitable product of the development of web to a certain stage. It not only has the core characteristics of SGML, but also has the simple characteristics of HTML, but also has many new characteristics, such as clarity and good structure.

There are three common ways for python to parse XML:

One is XML dom.* Module, which is the implementation of W3C DOM API. If you need to deal with DOM API, this module is very suitable. Pay attention to XML There are many modules in the DOM package, so we must distinguish between them;
The second is XML SAX.* Module, which is the implementation of SAX API. This module sacrifices convenience in exchange for speed and memory occupation. SAX is an event-based API, which means that it can process a large number of documents "in the air" without completely loading into memory;
The third is XML etree. Elementtree module (ET for short) provides a lightweight Python API. Compared with DOM, et is much faster, and there are many pleasant APIs to use. Compared with SAX, ET.iterparse also provides "in the air" processing mode. There is no need to load the whole document into memory. The average performance of ET is similar to that of SAX, But the API is a little more efficient and easy to use.

1, Explain in detail

Parsed xml file (country.xml):
View the CODE slice on CODE and derive it to my CODE slice

<?xml version="1.0"?> 
<data> 
  <country name="Singapore"> 
    <rank>4</rank> 
    <year>2011</year> 
    <gdppc>59900</gdppc> 
    <neighbor name="Malaysia" direction="N"/> 
  </country> 
  <country name="Panama"> 
    <rank>68</rank> 
    <year>2011</year> 
    <gdppc>13600</gdppc> 
    <neighbor name="Costa Rica" direction="W"/> 
    <neighbor name="Colombia" direction="E"/> 
  </country> 
</data> 

1,xml.etree.ElementTree

ElementTree is born to process XML. It has two implementations in Python standard library: one is pure python, such as XML etree. ElementTree, the other is faster XML etree. cElementTree. Note: try to use the one implemented in C language, because it is faster and consumes less memory.
View the CODE slice on CODE and derive it to my CODE slice

try: 
  import xml.etree.cElementTree as ET 
except ImportError: 
  import xml.etree.ElementTree as ET 

This is a common way to make different Python libraries use the same API. Starting from Python 3.3, the ElementTree module will automatically find available C libraries to speed up the speed, so you only need to import XML etree. ElementTree is OK.
View the CODE slice on CODE and derive it to my CODE slice

#!/usr/bin/evn python 
#coding:utf-8 
  
try: 
  import xml.etree.cElementTree as ET 
except ImportError: 
  import xml.etree.ElementTree as ET 
import sys 
  
try: 
  tree = ET.parse("country.xml")     #Open xml document 
  #root = ET.fromstring(country_string) #Pass xml from string 
  root = tree.getroot()         #Get root node  
except Exception, e: 
  print "Error:cannot parse file:country.xml."
  sys.exit(1) 
print root.tag, "---", root.attrib  
for child in root: 
  print child.tag, "---", child.attrib 
  
print "*"*10
print root[0][1].text   #Access by subscript 
print root[0].tag, root[0].text 
print "*"*10
  
for country in root.findall('country'): #Find all country nodes under the root node 
  rank = country.find('rank').text   #The value of the rank node under the child node 
  name = country.get('name')      #Value of attribute name under child node 
  print name, rank 
     
#Modify xml file 
for country in root.findall('country'): 
  rank = int(country.find('rank').text) 
  if rank > 50: 
    root.remove(country) 
  
tree.write('output.xml') 

Operation results:

201549105948952.png (509×377)

2, XML dom.*

Document Object Model (DOM) is a standard programming interface recommended by W3C to deal with extensible markup language. When parsing an XML document, a DOM parser reads the whole document at one time and saves all the elements in the document in a tree structure in memory. Then you can use different functions provided by Dom to read or modify the content and structure of the document, or write the modified content into the XML file. Using XML. XML in python dom. Minidom is used to parse XML files. Examples are as follows:
View the CODE slice on CODE and derive it to my CODE slice

#!/usr/bin/python 
#coding=utf-8 
  
from xml.dom.minidom import parse 
import xml.dom.minidom 
  
# Open an XML document using minidom parser 
DOMTree = xml.dom.minidom.parse("country.xml") 
Data = DOMTree.documentElement 
if Data.hasAttribute("name"): 
  print "name element : %s" % Data.getAttribute("name") 
  
# Get all countries in the collection 
Countrys = Data.getElementsByTagName("country") 
  
# Print details for each country 
for Country in Countrys: 
  print "*****Country*****"
  if Country.hasAttribute("name"): 
   print "name: %s" % Country.getAttribute("name") 
  
  rank = Country.getElementsByTagName('rank')[0] 
  print "rank: %s" % rank.childNodes[0].data 
  year = Country.getElementsByTagName('year')[0] 
  print "year: %s" % year.childNodes[0].data 
  gdppc = Country.getElementsByTagName('gdppc')[0] 
  print "gdppc: %s" % gdppc.childNodes[0].data 
  
  for neighbor in Country.getElementsByTagName("neighbor"):  
    print neighbor.tagName, ":", neighbor.getAttribute("name"), neighbor.getAttribute("direction") 
Operation results:
201549110124853.png (486×264)

3, XML sax.*

Sax is an event driven API. Parsing XML with Sax involves two parts: parser and event handler. The parser is responsible for reading the XML document and sending events to the event processor, such as element start and element end events; The event processor is responsible for making corresponding response to the event and processing the transmitted XML data. When using Sax to process XML in python, we should first introduce XML parse function in sax and XML sax. ContentHandler in handler. It is often used in the following situations: first, processing large files; 2, Only part of the contents of the document is required, or only specific information needs to be obtained from the document; 3, When you want to build your own object model.

Introduction to ContentHandler class method

(1) characters(content) method
Call time:
Starting from the line, there are characters before the tag is encountered, and the value of content is these strings.
From one tag, there are characters before the next tag is encountered, and the value of content is these strings.
From a tag, there are characters before the line terminator is encountered, and the value of content is these strings.
The label can be a start label or an end label.
(2) startDocument() method
Called when the document starts.
(3) End document () method
Called when the parser reaches the end of the document.
(4) startElement(name, attrs) method
Called when XML start tag is encountered. Name is the name of the tag and attrs is the attribute value Dictionary of the tag.
(5) endElement(name) method
Called when an XML end tag is encountered.
View the CODE slice on CODE and derive it to my CODE slice

#coding=utf-8 
#!/usr/bin/python 
  
import xml.sax 
  
class CountryHandler(xml.sax.ContentHandler): 
  def __init__(self): 
   self.CurrentData = "" 
   self.rank = "" 
   self.year = "" 
   self.gdppc = "" 
   self.neighborname = "" 
   self.neighbordirection = "" 
  
  # Element start event handling 
  def startElement(self, tag, attributes): 
   self.CurrentData = tag 
   if tag == "country": 
     print "*****Country*****"
     name = attributes["name"] 
     print "name:", name 
   elif tag == "neighbor": 
     name = attributes["name"] 
     direction = attributes["direction"] 
     print name, "->", direction 
  
  # End of event handling element 
  def endElement(self, tag): 
   if self.CurrentData == "rank": 
     print "rank:", self.rank 
   elif self.CurrentData == "year": 
     print "year:", self.year 
   elif self.CurrentData == "gdppc": 
     print "gdppc:", self.gdppc 
   self.CurrentData = "" 
  
  # Content event handling 
  def characters(self, content): 
   if self.CurrentData == "rank": 
     self.rank = content 
   elif self.CurrentData == "year": 
     self.year = content 
   elif self.CurrentData == "gdppc": 
     self.gdppc = content 
   
if __name__ == "__main__": 
   # Create an XMLReader 
  parser = xml.sax.make_parser() 
  # turn off namepsaces 
  parser.setFeature(xml.sax.handler.feature_namespaces, 0) 
  
   # Override ContextHandler 
  Handler = CountryHandler() 
  parser.setContentHandler(Handler) 
    
  parser.parse("country.xml") 

Operation results:

201549110218145.png (471×264)

4, libxml2 and lxml parsing xml

Libxml2 is an xml parser developed in C language. It is a free open source software based on MIT License. Many programming languages have its implementation based on it. The libxml2 module in python has a small disadvantage: the XPath eval() interface does not support the use of similar templates, but it does not affect the use. Because libxml2 is developed in C language, Therefore, the way of using API interface will inevitably be a little inappropriate.
View the CODE slice on CODE and derive it to my CODE slice

#!/usr/bin/python 
#coding=utf-8 
  
import libxml2 
  
doc = libxml2.parseFile("country.xml") 
for book in doc.xpathEval('//country'): 
  if book.content != "": 
    print "----------------------"
    print book.content 
for node in doc.xpathEval("//country/neighbor[@name = 'Colombia']"): 
  print node.name, (node.properties.name, node.properties.content) 
doc.freeDoc() 
201549110301575.png (457×318)

Lxml is based on libxml2 and developed in python language. From the perspective of use, lxml is more suitable for python developers than lxml, and the xpath() interface supports the usage of similar templates.
View the CODE slice on CODE and derive it to my CODE slice

#!/usr/bin/python 
#coding=utf-8 
  
import lxml.etree  
  
doc = lxml.etree.parse("country.xml")  
for node in doc.xpath("//country/neighbor[@name = $name]", name = "Colombia"):  
  print node.tag, node.items() 
for node in doc.xpath("//country[@name = $name]", name = "Singapore"):  
  print node.tag, node.items() 
201549110332839.png (513×57)

V summary

(1) The available class libraries or modules for XML parsing in Python include XML, libxml2, lxml, xpath, etc.
(2) Each parsing method has its own advantages and disadvantages. Before selection, all aspects of performance can be considered.
(3) If this blog post is insufficient, please take care of it.
(4) See the home page of Xiaobian for more wonderful content

For more free resource sharing + Classroom videos + E-books, please pay attention to yunyun school

Topics: Python Programming Big Data Programmer IT