Data analysis starts from scratch, and Pandas reads and writes Excel/XML data

Posted by vineld on Mon, 29 Nov 2021 17:23:21 +0100

Click to view the first article: Data analysis starts from scratch. Pandas reads HTML pages + data processing and analysis_
Click to view the first article: Data analysis starts from zero to actual combat, and Pandas reads and writes CSV data_

Data analysis starts from zero to actual combat, and Pandas reads and writes CSV data

python introduction to advanced, crawler data analysis, a full set of information sharing and explanation

The previous two articles talked about data analysis, virtual environment creation and pandas reading and writing data in csv, tsv and json formats. Today, we continue to explore pandas reading data.

I. summary of basic knowledge

1. Use pandas to read and write Excel files
2. Use pandas to read and write XML files

Second, start thinking

1. Use Python to read and write Excel

Read, using the ExcelFile() method of the Pandas library.

Write, using to of Pandas Library_ Excel method.

code

import pandas as pd
import os

# Gets the parent directory path of the current file
father_path = os.getcwd()
# Original data file path
rpath_excel = father_path+r'\data01\realEstate_trans.xlsx'
# Data saving path
wpath_excel = father_path+r'\data01\temp_excel.xlsx'

# Open excel file
excel_file = pd.ExcelFile(rpath_excel)

# Read file contents
"""
ExcelFile Object parse()Method reads the contents of the specified worksheet
ExcelFile Object sheet_names Property can be obtained Excel All worksheets in the file
 Dictionary expressions are also used to assign values to the dictionary (which looks more elegant)
"""

excel_read = {sheetName : excel_file.parse(sheetName) for sheetName in excel_file.sheet_names}

# Output the first 10 rows of the price column of the Sacramento table
print(excel_read['Sacramento'].head(10)['price'])
print(type(excel_read['Sacramento'].head(10)['price']))
# Encountered error: ModuleNotFoundError: No module named 'xlrd'

# Write the first 10 rows of the price column of the table
excel_read['Sacramento'].head(10)['price'].to_excel(wpath_excel, "price", index=False)
# Encountered error: ModuleNotFoundError: No module named 'openpyxl'
Copy code

Read results:

Write result:

Possible errors:

During read operation:
ModuleNotFoundError: No module named 'xlrd'
When writing:
ModuleNotFoundError: No module named 'openpyxl'
Copy code

resolvent:

# Install xlrd and openpyxl modules in the environment
pip install xlrd
pip install openpyxl
 Copy code

2. Use Python to read and write XML files

Students who have studied java should not be unfamiliar with XML. The full name is extensible markup language. Although it is not common at ordinary times, XML coding is supported in the Web API.

Read and write code

import pandas as pd
# A lightweight XML parser
import xml.etree.ElementTree as ET
import os

"""
    Read in XML Data, return pa.DataFrame
"""
def read_xml(xml_FileName):
    with open(xml_FileName, "r") as xml_file:
        # Read the data and store it in a tree structure
        tree = ET.parse(xml_file)
        # Access the stem node of the tree
        root = tree.getroot()
        # Return data in DataFrame format
        return pd.DataFrame(list(iter_records(root)))

"""
    Traversing recorded generators
"""
def iter_records(records):
    for record in records   :
        # Temporary dictionary for saving values
        temp_dict = {}
        # Traverse all fields
        for var in record:
            temp_dict[
                var.attrib["var_name"]
            ] = var.text
        # Generated value
        yield temp_dict

"""
    with XML Save data in format
"""
def write_xml(xmlFileName, data):
    with open(xmlFileName, "w") as xmlFile:
        # Write header
        xmlFile.write(
            '<?xml version="1.0" encoding="UTF-8"?>'
        )
        xmlFile.write('<records>\n')
        # Write data
        xmlFile.write(
            '\n'.join(data.apply(xml_encode, axis=1))
        )
        # Write tail
        xmlFile.write("\n</records>")

"""
    Encode each line into a specific nested format XML
"""
def xml_encode(row):
    # Step 1: output record node
    xmlItem = ['  <record>']
    # Step 2 -- add XML format < field name = ·· > ·· < / field > to each field in the row
    for field in row.index:
        xmlItem.append(
            '<var var_name="{0}">{1}</var>'.format(field, row[field])
        )
    # The last step -- mark the end tag of the record node
    xmlItem.append("  </record>")
    return '\n'.join(xmlItem)


# Gets the parent directory path of the current file
father_path = os.getcwd()
# Original data file path
rpath_xml = father_path+r'\data01\realEstate_trans.xml'
# Data saving path
wpath_xml = father_path+r'\data01\temp_xml.xml'
# Read data
xml_read = read_xml(rpath_xml)
# Output header 10 Line records
print(xml_read.head(10))
# Write back to the file in XML format
write_xml(wpath_xml, xml_read.head(10))
Copy code

Operation results

Code parsing

(1)read_xml(xml_FileName) function

Function: read in XML data and return pa.DataFrame

Here we use a lightweight XML parser: xml.etree.ElementTree. The file name is read, the file content is read first, then the XML is resolved by parse() function, a tree structure is created and stored in the tree variable, the getroot() method is called on the tree object to get the root node, and finally the iter_ is called. Records() function, pass in the root node, and then convert the returned information into DataFrame.

(2)iter_records(records) function

Function: traverse the recorded generator

iter_ The records () method is a generator, which can be seen from the keyword yield. Unlike return, the generator only returns one value to the calling method at a time until the end.

(3)write_xml(xmlFile, data) function

Function: save data in XML format

It should be noted here that we have to save according to the XML file format. What we need to do is three steps: save the header format, save the data according to the format, and save the tail format. When saving data, the apply() method of DataFrame object is used to traverse each internal row, and the first parameter is xml_encode specifies the method to be applied to each row of records. axis=1 means to process by row. The default value is 0, which means to process by column.

(4)xml_encode(row) function

Function: encode each row into XML in a specific nested format

In the process of writing data, we will call this method to process each row of data into XML format.

Three for you

I had a meeting yesterday and thought about writing. Here I share with you:

1. Ideological awareness, dialectical thinking. Don't follow suit, have your own opinions. Smart people should insist on outputting their own thoughts, think from the things themselves and other people's comments, correct their own thoughts, and then output;

2. Shout less slogans and do more practical things. Originally, I highly praised personal planning, but I found that not only my surroundings and some readers, including myself, planning is becoming more and more fake and empty. There is nothing wrong with the planning itself. The wrong thing is: in real life, we turn the planning into a daily slogan, and in order to complete it, so I now highly recommend: planning, do it first.


 

Topics: Python Back-end Programmer