XML (eXtensible Markup Language) refers to eXtensible Markup Language, which is designed to transmit and store data. It has increasingly become the core of many new technologies and has different applications in different fields. It is the inevitable product of the development of web to a certain stage. It not only has the core characteristics of SGML, but also has the simple characteristics of HTML, but also has many new characteristics, such as clarity and good structure.
There are three common ways for python to parse XML:
One is XML dom.* Module, which is the implementation of W3C DOM API. If you need to deal with DOM API, this module is very suitable. Pay attention to XML There are many modules in the DOM package, so we must distinguish between them;
The second is XML SAX.* Module, which is the implementation of SAX API. This module sacrifices convenience in exchange for speed and memory occupation. SAX is an event-based API, which means that it can process a large number of documents "in the air" without completely loading into memory;
The third is XML etree. Elementtree module (ET for short) provides a lightweight Python API. Compared with DOM, et is much faster, and there are many pleasant APIs to use. Compared with SAX, ET.iterparse also provides "in the air" processing mode. There is no need to load the whole document into memory. The average performance of ET is similar to that of SAX, But the API is a little more efficient and easy to use.
1, Explain in detail
Parsed xml file (country.xml):
View the CODE slice on CODE and derive it to my CODE slice
<?xml version="1.0"?> <data> <country name="Singapore"> <rank>4</rank> <year>2011</year> <gdppc>59900</gdppc> <neighbor name="Malaysia" direction="N"/> </country> <country name="Panama"> <rank>68</rank> <year>2011</year> <gdppc>13600</gdppc> <neighbor name="Costa Rica" direction="W"/> <neighbor name="Colombia" direction="E"/> </country> </data>
1,xml.etree.ElementTree
ElementTree is born to process XML. It has two implementations in Python standard library: one is pure python, such as XML etree. ElementTree, the other is faster XML etree. cElementTree. Note: try to use the one implemented in C language, because it is faster and consumes less memory.
View the CODE slice on CODE and derive it to my CODE slice
try: import xml.etree.cElementTree as ET except ImportError: import xml.etree.ElementTree as ET
This is a common way to make different Python libraries use the same API. Starting from Python 3.3, the ElementTree module will automatically find available C libraries to speed up the speed, so you only need to import XML etree. ElementTree is OK.
View the CODE slice on CODE and derive it to my CODE slice
#!/usr/bin/evn python #coding:utf-8 try: import xml.etree.cElementTree as ET except ImportError: import xml.etree.ElementTree as ET import sys try: tree = ET.parse("country.xml") #Open xml document #root = ET.fromstring(country_string) #Pass xml from string root = tree.getroot() #Get root node except Exception, e: print "Error:cannot parse file:country.xml." sys.exit(1) print root.tag, "---", root.attrib for child in root: print child.tag, "---", child.attrib print "*"*10 print root[0][1].text #Access by subscript print root[0].tag, root[0].text print "*"*10 for country in root.findall('country'): #Find all country nodes under the root node rank = country.find('rank').text #The value of the rank node under the child node name = country.get('name') #Value of attribute name under child node print name, rank #Modify xml file for country in root.findall('country'): rank = int(country.find('rank').text) if rank > 50: root.remove(country) tree.write('output.xml')
Operation results:
201549105948952.png (509×377)
2, XML dom.*
Document Object Model (DOM) is a standard programming interface recommended by W3C to deal with extensible markup language. When parsing an XML document, a DOM parser reads the whole document at one time and saves all the elements in the document in a tree structure in memory. Then you can use different functions provided by Dom to read or modify the content and structure of the document, or write the modified content into the XML file. Using XML. XML in python dom. Minidom is used to parse XML files. Examples are as follows:
View the CODE slice on CODE and derive it to my CODE slice
#!/usr/bin/python #coding=utf-8 from xml.dom.minidom import parse import xml.dom.minidom # Open an XML document using minidom parser DOMTree = xml.dom.minidom.parse("country.xml") Data = DOMTree.documentElement if Data.hasAttribute("name"): print "name element : %s" % Data.getAttribute("name") # Get all countries in the collection Countrys = Data.getElementsByTagName("country") # Print details for each country for Country in Countrys: print "*****Country*****" if Country.hasAttribute("name"): print "name: %s" % Country.getAttribute("name") rank = Country.getElementsByTagName('rank')[0] print "rank: %s" % rank.childNodes[0].data year = Country.getElementsByTagName('year')[0] print "year: %s" % year.childNodes[0].data gdppc = Country.getElementsByTagName('gdppc')[0] print "gdppc: %s" % gdppc.childNodes[0].data for neighbor in Country.getElementsByTagName("neighbor"): print neighbor.tagName, ":", neighbor.getAttribute("name"), neighbor.getAttribute("direction") Operation results: 201549110124853.png (486×264)
3, XML sax.*
Sax is an event driven API. Parsing XML with Sax involves two parts: parser and event handler. The parser is responsible for reading the XML document and sending events to the event processor, such as element start and element end events; The event processor is responsible for making corresponding response to the event and processing the transmitted XML data. When using Sax to process XML in python, we should first introduce XML parse function in sax and XML sax. ContentHandler in handler. It is often used in the following situations: first, processing large files; 2, Only part of the contents of the document is required, or only specific information needs to be obtained from the document; 3, When you want to build your own object model.
Introduction to ContentHandler class method
(1) characters(content) method
Call time:
Starting from the line, there are characters before the tag is encountered, and the value of content is these strings.
From one tag, there are characters before the next tag is encountered, and the value of content is these strings.
From a tag, there are characters before the line terminator is encountered, and the value of content is these strings.
The label can be a start label or an end label.
(2) startDocument() method
Called when the document starts.
(3) End document () method
Called when the parser reaches the end of the document.
(4) startElement(name, attrs) method
Called when XML start tag is encountered. Name is the name of the tag and attrs is the attribute value Dictionary of the tag.
(5) endElement(name) method
Called when an XML end tag is encountered.
View the CODE slice on CODE and derive it to my CODE slice
#coding=utf-8 #!/usr/bin/python import xml.sax class CountryHandler(xml.sax.ContentHandler): def __init__(self): self.CurrentData = "" self.rank = "" self.year = "" self.gdppc = "" self.neighborname = "" self.neighbordirection = "" # Element start event handling def startElement(self, tag, attributes): self.CurrentData = tag if tag == "country": print "*****Country*****" name = attributes["name"] print "name:", name elif tag == "neighbor": name = attributes["name"] direction = attributes["direction"] print name, "->", direction # End of event handling element def endElement(self, tag): if self.CurrentData == "rank": print "rank:", self.rank elif self.CurrentData == "year": print "year:", self.year elif self.CurrentData == "gdppc": print "gdppc:", self.gdppc self.CurrentData = "" # Content event handling def characters(self, content): if self.CurrentData == "rank": self.rank = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "gdppc": self.gdppc = content if __name__ == "__main__": # Create an XMLReader parser = xml.sax.make_parser() # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0) # Override ContextHandler Handler = CountryHandler() parser.setContentHandler(Handler) parser.parse("country.xml")
Operation results:
201549110218145.png (471×264)
4, libxml2 and lxml parsing xml
Libxml2 is an xml parser developed in C language. It is a free open source software based on MIT License. Many programming languages have its implementation based on it. The libxml2 module in python has a small disadvantage: the XPath eval() interface does not support the use of similar templates, but it does not affect the use. Because libxml2 is developed in C language, Therefore, the way of using API interface will inevitably be a little inappropriate.
View the CODE slice on CODE and derive it to my CODE slice
#!/usr/bin/python #coding=utf-8 import libxml2 doc = libxml2.parseFile("country.xml") for book in doc.xpathEval('//country'): if book.content != "": print "----------------------" print book.content for node in doc.xpathEval("//country/neighbor[@name = 'Colombia']"): print node.name, (node.properties.name, node.properties.content) doc.freeDoc() 201549110301575.png (457×318)
Lxml is based on libxml2 and developed in python language. From the perspective of use, lxml is more suitable for python developers than lxml, and the xpath() interface supports the usage of similar templates.
View the CODE slice on CODE and derive it to my CODE slice
#!/usr/bin/python #coding=utf-8 import lxml.etree doc = lxml.etree.parse("country.xml") for node in doc.xpath("//country/neighbor[@name = $name]", name = "Colombia"): print node.tag, node.items() for node in doc.xpath("//country[@name = $name]", name = "Singapore"): print node.tag, node.items() 201549110332839.png (513×57)
V summary
(1) The available class libraries or modules for XML parsing in Python include XML, libxml2, lxml, xpath, etc.
(2) Each parsing method has its own advantages and disadvantages. Before selection, all aspects of performance can be considered.
(3) If this blog post is insufficient, please take care of it.
(4) See the home page of Xiaobian for more wonderful content
For more free resource sharing + Classroom videos + E-books, please pay attention to yunyun school