06-2 document processing

Posted by duane on Tue, 24 Dec 2019 11:24:00 +0100

[TOC]

Introduction

The data generated in the process of application running is stored in memory at first. If you want to save it permanently, you must save it in the hard disk. If the application wants to operate the hardware, it must pass the operating system, and the file is the virtual concept that the operating system provides to the application program to operate the hard disk. The operation of the user or the application program to the file is to call the operating system, and then the operating system completes the specific operation of the hard disk.

Illustration: spoof 01

II. Basic process of document operation

2.1 basic process

With the concept of file, we don't need to think about the details of operating the hard disk any more, just focus on the process of operating the file:

# 1. Open the file. The application program initiates a system call open(...) to the operating system. The operating system opens the file, corresponds to a piece of hard disk space, and returns a file object assigned to a variable f
f=open('a.txt','r',encoding='utf-8') #The default open mode is r

# 2. Calling the read / write method under the file object will be converted to read / write hard disk operation by the operating system
data=f.read()

# 3. Send a request to the operating system to close the file and recycle system resources
f.close()

Illustrations: file objects

2.2 resource recovery and with context management

Opening a file contains two parts of resources: the variable f of the application and the file opened by the operating system. When a file is operated, all the resources of the two parts of the file must be recycled. The recycling method is as follows:

1,f.close() #Reclaim file resources opened by the operating system
2,del f #Reclaim application level variables

del f must occur after f.close(). Otherwise, the file opened by the operating system cannot be closed, which takes up resources,
The automatic garbage collection mechanism of Python determines that we don't need to consider del f, which requires us to remember f.close() after the operation of the file. Although we emphasize this, most readers will still forget f.close(). Considering this, python provides with keyword to help us manage the context

# 1. After executing the subcode block, with will automatically execute f.close()
with open('a.txt','w') as f:
    pass 

# 2. You can use with to open multiple files at the same time, separated by commas
with open('a.txt','r') as read_f,open('b.txt','w') as write_f:  
    data = read_f.read()
    write_f.write(data)

Illustration: hoax 02

2.3 specifying the character encoding of the operation text file

f = open(...) is opened by the operating system file, if the text file is opened, it will involve the character encoding problem. If the code is not specified for open, then the default encoding of the open text file is obvious that the operating system has said that the operating system will open the file with its own default code, gbk under windows, utf-8 under linux.
This uses the knowledge of character encoding that we talked about in the last lesson: if we want to ensure that there is no random code, the file will be opened in whatever way it is saved.

f = open('a.txt','r',encoding='utf-8')

Illustration: spoof 03

Operation mode of three documents

3.1 control the mode of file reading and writing

R (default): read only
 w: write only
 a: Append only

3.1.1 case 1: use of r mode

# r read only mode: if the file does not exist, an error will be reported. If the file exists, the pointer in the file will jump to the beginning of the file directly
 with open('a.txt',mode='r',encoding='utf-8') as f:
     res=f.read() # The contents of the file will be read into memory from the hard disk and assigned to res

# Small exercise: realize user authentication function
 inp_name=input('Please enter your name: ').strip()
 inp_pwd=input('Please enter your password: ').strip()
 with open(r'db.txt',mode='r',encoding='utf-8') as f:
     for line in f:
         # Compare the user's input name and password with the read-out content
         u,p=line.strip('\n').split(':')
         if inp_name == u and inp_pwd == p:
             print('Login successfully')
             break
     else:
         print('Wrong account name or password')

3.1.2 case 2: use of w mode

# w write only mode: when the file does not exist, an empty document will be created. When the file exists, the file will be emptied. The file pointer runs to the beginning of the file
with open('b.txt',mode='w',encoding='utf-8') as f:
    f.write('Hello\n')
    f.write('I am good.\n') 
    f.write('Hello everyone\n')
    f.write('111\n222\n333\n')
#Emphasize:
# 1. If the file is not closed, write continuously, and the content written later must follow the content written earlier
# 2 if you reopen the file in w mode, the contents of the file will be emptied

3.1.3 case 3: use of mode a

# a only append write mode: when the file does not exist, an empty document will be created, and the file pointer will be moved directly to the end of the file if the file exists
 with open('c.txt',mode='a',encoding='utf-8') as f:
     f.write('44444\n')
     f.write('55555\n')
#The similarities and differences between w mode and a mode are emphasized
# 1. Same point: when the open file is not closed, write continuously, and the newly written content will always follow the previously written content
# 2 difference: reopening the file in a mode will not empty the contents of the original file, will move the file pointer directly to the end of the file, and the newly written contents will always be written at the end

# Small exercise: realize the registration function:
 name=input('username>>>: ').strip()
 pwd=input('password>>>: ').strip()
 with open('db1.txt',mode='a',encoding='utf-8') as f:
     info='%s:%s\n' %(name,pwd)
     f.write(info)

3.1.4 case 4: use of + mode (understanding)

#r+ w+ a +: readable and writable
 #In normal work, we only use r/w/a, either read-only or write only, and generally do not use the readable and writable mode

Illustration: spoof 04

3.2 mode of controlling the reading and writing contents of documents

Big premise: tb mode cannot be used alone and must be used in combination with one of r/w/a
 t (default): text mode
    1. All read and write files are in strings
    2. Only for text files
    3. encoding parameter must be specified
 b: Binary mode:
   1. Read and write files are in bytes / binary
   2. For all documents
   3. The encoding parameter must not be specified

3.2.1 case 1: use of t mode

# t mode: if the file opening mode we specified is r/w/a, the default is rt/wt/at
 with open('a.txt',mode='rt',encoding='utf-8') as f:
     res=f.read() 
     print(type(res)) # The output result is: < class' STR '>

 with open('a.txt',mode='wt',encoding='utf-8') as f:
     s='abc'
     f.write(s) # Must also be of string type

 #Emphasis: T mode can only be used to operate text files. No matter read or write, it should be based on strings. The nature of hard disk access is binary. When t mode is specified, internal coding and decoding are done for us

3.2.2 case II: use of mode b

# b: Read and write in binary units
 with open('1.mp4',mode='rb') as f:
     data=f.read()
     print(type(data)) # The output result is: < class' bytes' >

 with open('a.txt',mode='wb') as f:
     msg="Hello"
     res=msg.encode('utf-8') # res is of type bytes
     f.write(res) # Only bytes can be written to a file in b mode

#Emphasis: b mode vs. t mode
1,In the operation of plain text files t Patterns help us to eliminate the coding and decoding links, b Mode requires manual encoding and decoding, so at this time t More convenient mode
2,For non text files (such as pictures, videos, audio, etc.) only use b Pattern

# Small exercise: writing copy tools
src_file=input('Source file path: ').strip()
dst_file=input('Destination file path: ').strip()
with open(r'%s' %src_file,mode='rb') as read_f,open(r'%s' %dst_file,mode='wb') as write_f:
    for line in read_f:
        # print(line)
        write_f.write(line)

Illustration: hoax 06

IV. method of operating documents

4.1 key points

# Read operation
f.read()  # Read all contents, and the file pointer will move to the end of the file after the operation
f.readline()  # Read one line and move the cursor to the first part of the second line
f.readlines()  # Read each line and store it in the list

# Emphasize:
# f.read() and f.readlines() both read the content into the content at one time. If the content is too large, resulting in memory overflow, and if you want to read all the content into the memory, you must read it in several times. There are two implementation methods:
# One way
with open('a.txt',mode='rt',encoding='utf-8') as f:
    for line in f:
        print(line) # Read only one line of content into memory at the same time

# Mode two
with open('1.mp4',mode='rb') as f:
    while True:
        data=f.read(1024) # Only 1024 Bytes are read into memory at the same time
        if len(data) == 0:
            break
        print(data)

# Write operation
f.write('1111\n222\n')  # For the writing of text mode, you need to write line breaks yourself
f.write('1111\n222\n'.encode('utf-8'))  # For the writing of b mode, you need to write the line break by yourself
f.writelines(['333\n','444\n'])  # File mode
f.writelines([bytes('333\n',encoding='utf-8'),'444\n'.encode('utf-8')]) #b mode

Illustration: spoof 07

4.2 understanding

f.readable()  # Is the file readable
f.writable()  # Is the file readable
f.closed  # Close file or not
f.encoding  # If the file open mode is b, this property is not available
f.flush()  # Brush file contents from memory to hard disk immediately
f.name

Illustration: spoof 08

V. active control of pointer movement in files

#Big premise: the movement of the pointer in the file is in Bytes. The only exception is read (n) in t mode. N is in characters
with open('a.txt',mode='rt',encoding='utf-8') as f:
     data=f.read(3) # Read 3 characters

with open('a.txt',mode='rb') as f:
     data=f.read(3) # Read 3 Bytes

# Previously, the pointer movement in a file was passively triggered by read / write operations. If you want to read data in a specific location of a file, you need to use f.seek method to actively control the pointer movement in the file. The detailed usage is as follows:
# f. Seek (number of bytes moved by pointer, mode control): 
# Mode control:
# 0: the default mode, which means that the number of bytes the pointer moves is referenced at the beginning of the file
# 1: This mode indicates that the number of bytes the pointer moves is based on the current position
# 2: This mode indicates that the number of bytes moved by the pointer is based on the position at the end of the file
# Emphasis: mode 0 can be used in mode t or b, while modes 1 and 2 can only be used in mode b

Illustration: spoof 05

5.1 case 1: 0 mode explanation

# a.txt is encoded with utf-8, and the contents are as follows (1 byte for abc, 3 bytes for Chinese "hello")
abc Hello

# Use of 0 mode
with open('a.txt',mode='rt',encoding='utf-8') as f:
    f.seek(3,0)     # 3 bytes moved at the beginning of the reference file
    print(f.tell()) # View the position of the current file pointer from the beginning of the file. The output is 3
    print(f.read()) # Read from the position of the third byte to the end of the file, and the output is: Hello
    # Note: in t mode, the read content will be decoded automatically, so it must be ensured that the read content is a complete Chinese data, otherwise the decoding fails

with open('a.txt',mode='rb') as f:
    f.seek(6,0)
    print(f.read().decode('utf-8')) #Output: good

Illustration: spoof 19

5.2 case II: 1 model details

# 1 use of mode
with open('a.txt',mode='rb') as f:
    f.seek(3,1) # Move 3 bytes back from the current location, which is the beginning of the file
    print(f.tell()) # Output: 3
    f.seek(4,1)     # Move 4 bytes backward from the current position, and the current position is 3
    print(f.tell()) # Output: 7

Illustration: spoof 20

5.3 case 3: 2 model details

# a.txt is encoded with utf-8, and the contents are as follows (1 byte for abc, 3 bytes for Chinese "hello")
abc Hello

# 2 use of mode
with open('a.txt',mode='rb') as f:
    f.seek(0,2)     # Move 0 bytes to the end of the reference file, that is, skip to the end of the file directly
    print(f.tell()) # Output: 9
    f.seek(-3,2)     # Three bytes forward at the end of the reference file
    print(f.read().decode('utf-8')) # Output: good

# Small exercise: realize the effect of dynamically viewing the latest log
import time
with open('access.log',mode='rb') as f:
    f.seek(0,2)
    while True:
        line=f.readline()
        if len(line) == 0:
            # No content
            time.sleep(0.5)
        else:
            print(line.decode('utf-8'),end='')

Illustration: spoof 21

Vi. modification of documents

#The contents of file a.txt are as follows
 Zhang Yidan Shandong 17949 12344234523
 Li Erdan Hebei 163 57 13913453521
 Wang quandan Shanxi 153 62 18651433422

#Perform action
with open('a.txt',mode='r+t',encoding='utf-8') as f:
    f.seek(9)
    f. Write ('< women director >')

#The revised contents of the document are as follows
 Zhang Yidan, director of women, 179 49 12344234523
 Li Erdan Hebei 163 57 13913453521
 Wang quandan Shanxi 153 62 18651433422

He stressed:
#1. The hard disk space cannot be modified. The data in the hard disk is updated to cover the old content with new content
 #2. The data in memory can be modified

The file corresponds to the hard disk space, and the hard disk cannot be modified, which corresponds to the nature of the file,
Then we see that the content of the file can be modified. How is it implemented?
The general idea is to read the contents of the files in the hard disk into the memory, and then overwrite them back to the hard disk after modifying them in the memory
There are two ways to realize it

6.1 document modification method I

# Implementation idea: read all the contents of the file into the memory at one time, and then overwrite and write back the original file after the modification in the memory
# Advantage: only one copy of the same data in the process of document modification
# Disadvantage: too much memory
with open('db.txt',mode='rt',encoding='utf-8') as f:
    data=f.read()

with open('db.txt',mode='wt',encoding='utf-8') as f:
    f.write(data.replace('kevin','SB'))

6.1 document modification mode II

# Implementation idea: open the original file by reading, open a temporary file by writing, read the contents of the original file line by line, write the temporary file after modification, delete the original file and rename the original file name
# Advantage: not taking up too much memory
# Disadvantages: two copies of the same data are saved in the process of document modification
import os

with open('db.txt',mode='rt',encoding='utf-8') as read_f,\
        open('.db.txt.swap',mode='wt',encoding='utf-8') as wrife_f:
    for line in read_f:
        wrife_f.write(line.replace('SB','kevin'))

os.remove('db.txt')
os.rename('.db.txt.swap','db.txt')

Illustration: hoax 09

Topics: Python encoding Windows Linux

Programmer Think