Python file operation - reading and writing of text file, binary file, CSV file, OS, shutil, CSV module, common character encoding

Posted by gnathan87 on Mon, 18 Oct 2021 05:59:27 +0200

Text and binary files

text file
The text file stores ordinary "character" text, which can be opened with Notepad program.
Binary file
Binary files store the data content in "bytes" and cannot be opened with Notepad.

File operation related modules

name	explain
io module	Input and output operations of file stream input output
os module	Basic operating system functions, including file operation
glob module	Find file pathnames that match specific rules
fnmatch module	Use patterns to match file pathnames
fileinput module	Process multiple input files
filecmp module	For file comparison
cvs module	For csv file processing
pickle and cPickle	Used for serialization and deserialization
xml package	For XML data processing
bz2,gzip,zipfile,zlib,tarfile	Used to process compressed and decompressed files (corresponding to different algorithms)

open() creates a file object

The open() function is used to create a file object
Open (file name [, opening method])
To reduce the input of "\", you can use the original string: r "d:\b.txt"

pattern	describe
r	read mode
w	write mode. If the file does not exist, create it; If the file exists, rewrite the new content;
a	Append mode. If the file does not exist, create it; If the file exists, append content to the end of the file
b	Binary binary mode (can be combined with other modes)
+	Read and write mode (can be combined with other modes)

Note: creation of text file object and binary file object:
If the mode "b" is not added, the text file object is created by default, and the basic unit of processing is "character".
Add binary mode "b", the binary file object is created, and the basic unit of processing is "byte".

Common properties and methods of file objects

File object properties:

attribute	explain
name	Returns the name of the file
mode	Returns the open mode of the file
closed	Returns True if the file is closed

File object open mode:

pattern	explain
r	Read mode
w	Write mode
a	append mode
b	Binary mode (can be combined with other modes)
+	Read / write mode (other modes can be combined)

Common methods for file objects:

Method name	explain
read([size])	Read the contents of size bytes or characters from the file and return. If [size] is omitted, it will be read to the end of the file, that is, all contents of the file will be read at one time
readline()	Read a line from a text file
readlines()	Each line in the text file is treated as an independent string object, and these objects are returned in the list
write(str)	Writes the string str contents to a file
writelines(s)	Writes the string list s to the file without adding line breaks
seek(offset[,whence])	Move the file pointer to the new position, and offset represents the offset of how many bytes relative to where; Offset: positive to the end and negative to the start. Different values represent different meanings: 0: calculated from the file header (default) 1: calculated from the current position 2: calculated from the end of the file
tell()	Returns the current position of the file pointer
truncate([size])	No matter where the pointer is, only the first size bytes of the pointer are left, and the rest are deleted; If no size is passed in, all contents will be deleted when the pointer reaches the end of the file
flush()	Writes the contents of the buffer to the file without closing the file
close()	Write the contents of the buffer to the file, close the file at the same time, and release the resources related to the file object

pickle serialization

Serialization refers to converting objects into "serialized" data form, storing them on hard disk or transmitting them to other places through network. Deserialization refers to the reverse process of converting the read "serialized data" into objects.

The functions in pickle module are used to realize serialization and deserialization.

Serialize & deserialize:
pickle.dump(obj, file) obj is the object to be serialized, and file refers to the stored file
pickle.load(file) reads data from file and deserializes it into objects

Text file reading and writing

Text file writing steps

There are three steps to writing:
1. Create file object
2. Write data
3. Close the file object

write()/writelines() writes data

write(a): write the string a to the file. writelines(b): write the string list to the file without adding line breaks

close() closes the file stream

An open file object must explicitly call the close() method to close the file object. When the close() method is called, the buffer data will be written to the file first (or the flush() method can be called directly), and then the file will be closed to release the file object.
In order to ensure that the open file object is closed normally, it is generally implemented in combination with the finally or with keyword of the exception mechanism.

with statement (context manager)

Automatically manage context resources. No matter why the with block jumps out, it can ensure that the file is closed correctly, and can automatically restore the scene when entering the code block after the code block is executed.

Reading of text files

Generally, there are three methods:
1. read([size]) reads size characters from the file and returns them as results. If there is no size parameter, the entire file is read. Reading to the end of the file returns an empty string.
2. readline() reads a line and returns it as a result. Reading to the end of the file returns an empty string.
3. In the readlines() text file, each line is stored in the list as a string and the list is returned

Binary file reading and writing

The processing flow of binary files is consistent with that of text files. However, you need to specify a binary schema to create a binary object.

f = open(r"d:\a.txt", 'wb') #Writable, overridden binary object 
f = open(r"d:\a.txt", 'ab') #Writable, append mode binary object 
f = open(r"d:\a.txt", 'rb') #Readable binary object

After creating binary file objects, you can still use write() and read() to read and write files.

CSV file reading and writing

csv(Comma Separated Values) is a comma separated text format, which is commonly used for data exchange, import and export of Excel files and database data. Unlike Excel files, CSV files:
1. The value has no type, and all values are strings
2. Font color and other styles cannot be specified
3. The width and height of cells cannot be specified, and cells cannot be merged
4. There are no multiple worksheets
5. Image chart cannot be embedded

For example: Excel table:

Save as CSV format and open with Notepad:
Name, telephone, address
Xiaoming, 18889303000, Jinfeng Road
Xiaohong, 18829920000, Wuyuan Road
Wang Ming, 16668829922, Fengtian Road

csv module

The module csv of Python standard library provides objects for reading and writing csv format files

csv.reader object (csv file reading)

import csv
with open(r"e:\a.csv") as a:
        a_csv = csv.reader(a) #Create a csv object, which is a list of all data, one element per line
        headers = next(a_csv) #Gets a list object that contains information about the title row
        print(headers)
        for row in a_csv: #Cycle through lines
                print(row)
                
##print
['full name', 'Telephone', 'address']
['Xiao Ming', '18889303000', 'Jinfeng Road']
['Xiao Hong', '18829920000', 'Wuyuan Road']
['Wang Ming', '16668829922', 'Fengtian Road']

csv.writer object (csv file write)

import csv
headers = ["Job number","full name","Age","address","a monthly salary"]
rows =[("1001","Wang Ming",18,"Xisanqi No. 1 hospital","50000"),("1002","Gao Ba",19,"Xisanqi No. 1 hospital","30000")]
with open(r"d:\b.csv","w") as b:
        b_csv = csv.writer(b) #Create csv object
        b_csv.writerow(headers) #Write one line (title)
        b_csv.writerows(rows) #Write multiple rows (data)

os module

os module can help us operate the operating system directly.

Os.system (execute system command)

import os
os.system("ping www.baidu.com")

Note: the Chinese code may be garbled, and the IDE code needs to be adjusted to GBK

os.startfile (directly call the executable)

#Start wechat
import os
os.startfile(r"C:\Program Files (x86)\Tencent\WeChat\WeChat.exe")

os module - file and directory related operations

Common file operations:

Method name	describe
remove(path)	Delete the specified file
rename(src,dest)	Rename a file or directory
stat(path)	Returns all properties of the file
listdir(path)	Returns the list of files and directories in the path directory

Common directory operations:

Method name	describe
mkdir(path)	Create directory
makedirs(path1/path2/path3/... )	Create multi-level directory
rmdir(path)	Delete directory
removedirs(path1/path2...)	Delete multi-level directory
getcwd()	Return to the current working directory: current work dir
chdir(path)	Set path to the current working directory
walk()	Traverse the directory tree sep the path separator used by the current operating system

#coding=gbk
#Test the file directory related operations in the os module
import os
#############Get information about files and folders################
print (os.name) #Windows - > NT Linux and UNIX - > POSIX
print (os.sep) #Windows - > \ Linux and UNIX - >/
print (repr(os.linesep)) #windows->\r\n linux-->\n\
print(os.stat("main.py"))

##############About working directory operations###############
#Note: relative paths are relative to the current working directory
print(os.getcwd()) #Current working directory
#os.chdir("d:") #Change the current working directory to: d: root directory
#os.mkdir("book".encode("GBK")) #Create directory
#os.rmdir("book") #Delete directory
#os.makedirs("film/Hong Kong and Taiwan/Zhou Xingchi") #Create multi-level directory
#os.removedirs("film/Hong Kong and Taiwan/Zhou Xingchi") #Only empty directories can be deleted
#os.rename("movie", "movie")
# dirs = os.listdir("movie")
# print(dirs)

os.path module

os.path module provides directory related operations (path judgment, path segmentation, path connection, folder traversal).

method	describe
isabs(path)	Determine whether the path is an absolute path
isdir(path)	Determine whether the path is a directory
isfile(path)	Determine whether the path is a file
exists(path)	Judge whether the file in the specified path exists
getsize(filename)	Returns the size of the file in bytes
abspath(path)	Return absolute path
dirname§	Returns the path to the directory
getatime(filename)	Returns the last access time of the file
getmtime(filename)	Returns the last modification time of the file
walk(top,func,arg)	Traversing directories recursively
join(path,*paths)	Connecting multiple path s
split(path)	Split the path and return it as a list
splitext(path)	Splits the file extension from the path

##encoding: utf-8
#Common methods of testing os.path
import os.path
#################Obtain basic information of directory and file
print(os.path.isabs("d:/a.txt")) #Absolute path
print(os.path.isdir("d:/a.txt")) #Directory
print(os.path.isfile("d:/a.txt")) #File
print(os.path.exists("a.txt")) #Does the file exist
print(os.path.getsize("a.txt")) #file size
print(os.path.abspath("a.txt")) #Output absolute path
print(os.path.dirname("d:/a.txt")) #Output directory
########Obtain the creation time, access time and last modification time##########
print(os.path.getctime("a.txt")) #Return creation time
print(os.path.getatime("a.txt")) #Return last access time
print(os.path.getmtime("a.txt")) #Returns the last modification time
################Divide and connect paths############
path = os.path.abspath("a.txt") #Return absolute path
print(os.path.split(path)) #Return tuple: directory, file
##print ('E:\\PythonProject', 'a.txt')
print(os.path.splitext(path)) #Return tuple: path, extension
##print ('E:\\PythonProject\\a', '.txt')
print(os.path.join("aa","bb","cc")) #Return path: aa/bb/cc

walk() recursively traverses all files and directories

os.walk() method:
Returns a tuple of 3 elements (dirpath, dirnames, filenames)
dirpath: the path to list the specified directory
dirnames: all folders in the directory
filenames: all files in the directory

#coding=utf-8
import os
all_files = []
path = os.getcwd()
list_files = os.walk(path)
for dirpath,dirnames,filenames in list_files:
    for dir in dirnames:
        all_files.append(os.path.join(dirpath,dir))
    for name in filenames:
        all_files.append(os.path.join(dirpath,name))
for file in all_files:
    print (file)

shutil module (copy and compression)

shutil module is mainly used to copy, move and delete files and folders; You can also compress and decompress files and folders.
The os module provides general operations on directories or files. As a supplement, the shutil module provides operations such as moving, copying, compressing and decompressing, which are not provided by these os modules.

#encoding=gbk
import shutil
import zipfile
#copy file content
#shutil.copyfile("a.txt","a_copy.txt")

#"Music" folder does not exist to use!!!
#Copy the contents under the folder "movies / learning" to the folder "music". Ignore all html and htm files when copying.
#shutil.copytree("movie / RTHK", "music", ignore=shutil.ignore_patterns("*.html","*.htm"))

#Compress all contents in the "movies / Hong Kong and Taiwan" folder into the "music 2" folder to generate movie.zip
#shutil.make_archive("music / movie","zip", "movie / RTHK")

#Compress: compress the specified multiple files into a zip file
# z = zipfile.ZipFile("a.zip","w")
# z.write("1.txt")
# z.write("2.txt")
# z.close()
#Decompression:
# z2 = zipfile.ZipFile("a.zip","r")
# z2.extractall("d:/") #Set the decompression address
# z2.close()

Common character coding

ASCII

ASCII code is represented by 7 bits and can only represent 128 characters. The highest bit of one byte ASCII encoding is always 0.

ISO8859-1

ISO-8859-1, also known as Latin-1, is an 8-bit single byte character set. It also makes use of the highest bit of ASCII and is compatible with ASCII. The new space is 128, but it is not completely used up. The corresponding text symbols of Western European language, Greek, Thai, Arabic and Hebrew are added on top of ASCII coding, which is downward compatible with ASCII coding.

GB2312,GBK,GB18030

GB2312

GB2312, fully known as the Chinese character coded character set for information exchange, was released in China in 1980 and is mainly used for Chinese character processing in computer systems. Covering most Chinese characters, it can not deal with special rare words such as ancient Chinese, so later codes such as GBK and GB18030 appeared.
GB2312 is fully compatible with ISO8859-1.

GBK

The Chinese character internal code extension specification mainly extends GB2312. Formulated in 1995

GB18030

The latest internal code word set was released in 2000. It mainly adopts single byte, double byte and four byte character coding. It is downward compatible with GB2312 and GBK. GBK and GB2312 are used most.

Unicode

Unicode encoding is designed to fix two bytes, and all characters use 16 bits.
Unicode is completely redesigned and is not compatible with iso8859-1 or any other encoding.

UTF-8

For English letters, unicode also needs two bytes to represent. Therefore, unicode is not convenient for transmission and storage. Therefore, UTF coding is generated.

UTF encoding is compatible with iso8859-1 encoding and can also be used to represent characters in all languages. However, UTF encoding is variable length encoding, and the length of each character ranges from 1-4 bytes. Among them, English letters are represented by one byte, while Chinese characters are represented by three bytes.

Chinese garbled code problem

The default code of windows operating system is GBK, and the default code of Linux operating system is UTF-8. When we use open(), we call the file opened by the operating system, and the default code is GBK.

Topics: Python

Programmer Think