[Python] parsing Xml documents

Posted by RedMaster on Fri, 25 Feb 2022 13:40:30 +0100

xml document is nothing more than a tree data warehouse, and there are four basic parts: addition, deletion, modification and query.

Parse tree structure

  • Read from hard disk
  • Read from string

Note: XML etree. Elementtree module is not safe when dealing with malicious structure data.

from xml.etree import ElementTree

# import data from our dataset
tree = ElementTree.parse([path of xml file])

# pick the root of xml tree
root = tree.getroot()

Note: parse is not required when reading from the string, because what fromstring directly returns is our root node.

from xml.etree import ElementTree

# pick the root of xml tree
root = ElementTree.fromstring(country_data_as_string)

Among them, tree is easy to understand, which is the tree of our xml file. Root is our root node.

root belongs to the element object and has the following attributes:

  1. tag: string object, indicating the type of data representation.
  2. attrib: a dictionary object that represents the attached attributes.
  3. text: string object, representing the content of element.
  4. tail: a string object that represents the wake after the element is closed.
  5. Several child elements. These child elements can be indexed by index.
<tag attrib1=1>text</tag>tail
  1     2        3         4

Tip: if you want to speed up, you can use the API XML compiled in C language etree. cElementTree. Priority should be given to import when importing. The code is modified as follows.

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

ElementTree.Element Class

class xml.etree.ElementTree.Element(tag, attrib={}, **extra)

    # attribute
    tag: string,The type of data represented by the element.
    attrib: dictionary,Attribute dictionary of the element.
    text: string,The content of the element.
    tail: string,The tail shape of the element.
    
    # Actions on attributes
    clear(): Clear descendants, attributes text and tail Also set to None. 
    get(key, default=None): obtain key The corresponding property value. If the property does not exist, it will be returned default Value.
    items(): Returns a list according to the attribute dictionary. The list element is(key, value). 
    keys(): Returns a list of all element attribute keys.
    set(key, value): Set new attribute keys and values.

    # Actions for future generations
    ## Add new element
    append(subelement): Add an immediate child element.
    extend(subelements): Add a string of element objects as child elements.
    insert(index, element): Inserts a child element at the specified location.

    ## Delete element
    remove(subelement): Delete child elements.

    ## Traverse elements to get iter or list
    find(match): Find the first matching sub element. The matching object can be tag or path. 
    findall(match): Find all matching sub elements. The matching object can be tag or path. 
    findtext(match): Find the first matching sub element and return its text Value. The matching object can be tag or path. 
    iter(tag=None): Generate or traverse all descendants of the current element tag Iterator for descendants of.
    iterfind(match): according to tag or path Find all descendants.
    itertext(): Traverse all descendants and return text Value.

ElementTree Object

class xml.etree.ElementTree.ElementTree(element=None, file=None)
    element New if given ElementTree The root node of the.

    _setroot(element): With the given element Replace the current root node. Use with caution.
    getroot(): Get the root node.
    
    parse(source, parser=None): load xml Object, source Can be a file name or file type object.
    
    # Writeback method write
    write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None,method="xml")

    # The following methods are similar to the methods with the same name in the Element class, except that they specify the root node as the operand.
    find(match)
    findall(match)
    findtext(match, default=None)
    iter(tag=None)
    iterfind(match)

Add, delete, modify and check

I thought about it. The object-oriented method is easier to think and organize ideas. In actual use, the above methods should be regarded as an xml file object, and then sorted into a separate Class.

Practical application of AI tuner in small projects:

class xmlResolver(xmlFilePath)
    xmlWri

Python object oriented review

method

self represents the instance of a class. self is necessary when defining the method of a class, although it is not necessary to pass in the corresponding parameters when calling.

init() method is a special method, which is called the constructor or initialization method of a class. It will be called when an instance of this class is created.

Class

dict: attribute of the class (including a dictionary, which is composed of data attributes of the class)

doc: document string of class

Name: class name

Module: the module where the class definition is located (the full name of the class is' main.className '. If the class is in an import module mymod, className.module is equal to mymod)

bases: the constituent elements of all the parent classes of a class (including a tuple composed of all the parent classes)

Subclass parent class

class Derived class name(Base class name)
    ...

Note: python allows multiple parent class inheritance, which is called multiple inheritance.

Then, the concept of method rewriting in python refers to the method of subclass rewriting parent class, which is different from Java.

Overload method of foundation

Serial number

method

describe

Simple application

1

init ( self [,args...] )

Constructor

Simple call method: obj = className(args)

2

del( self )

Destruct method, delete an object

Simple call method: del obj

3

repr( self )

Converted to a form for the interpreter to read

Simple call method: repr(obj)

4

str( self )

Used to convert a value into a form suitable for human reading

Simple call method: str(obj)

5

cmp ( self, x )

Object comparison

Simple call method: cmp(obj, x)

Private class method

__ private_method: it starts with two underscores and declares that the method is private and cannot be called outside the class. Call self. Inside the class__ private_ methods

Description of single underline, double underline and double underline at the beginning and end

foo: it defines special methods, generally system defined names, such as init().

_ foo: variables starting with a single underscore represent protected variables, that is, protected types can only be accessed by themselves and subclasses, not from module import*

__ foo: Double underscores represent variables of private type, which can only be accessed by the class itself.

reference material

  1. Python standard library XML etree
  2. Python object oriented