Main content of this paper
character
byte
Structure and memory view
Conversion between characters and bytes -- codec
BOM ghost character
Continue tomorrow...
The code in this article is placed on github: https://github.com/ampeeg/cnblogs/tree/master/python advanced
character
''' Character encoding is a problem that often troubles python programmers. I often encounter this headache in the process of writing crawlers. Starting from Python 3, we have made a clear distinction between human language (text string) and machine language (binary byte). Let's start with text string Before you start, define the "character": Character: Unicode character, the element obtained from the str object of python3 is Unicode character String: a string is a sequence of characters (in this case, the contents in (1) correspond to each other) ''' if __name__ == "__main__": #Create character s1 = str('a') s2 = 'b' s3 = u'c' print(s1, s2, s3) # a b c |
At this time, just remember that in Python 3, the character is unicode, that is, str is unicode, which is a language human can understand.
byte
''' python3 There are two basic types of binary sequences built into: immutable bytes And variable bytearray (1)bytes and bytearray The elements of are between 0~255(8 individual bit)Integer between; (2)Slices of binary sequences are always binary sequences of the same type ''' if __name__ == "__main__": # Establish bytes and bytearray b1 = bytes('abc Hello', encoding='utf8') # about encode Later, I'll say that I don't know if anyone, like me, always confuses the direction of coding and decoding print(b1) # b'abc\xe4\xbd\xa0\xe5\xa5\xbd' b2 = bytearray('abc Hello', encoding='utf8') print(b2) # bytearray(b'abc\xe4\xbd\xa0\xe5\xa5\xbd') # Slice (hint: all sequences can be sliced) print(b1[3:5]) # b'\xe4\xbd' print(b2[3:5]) # bytearray(b'\xe4\xbd') # Try using the method of list value print(b1[3]) # 228 At this point, it is not a byte sequence, but an element for _ in b1: print(_, end=',') # 97,98,99,228,189,160,229,165,189, These are all 8 bit Integer # bytes Immutable vs. bytearray Variable # b1[3] = 160 # Error reported: 'bytes' object does not support item assignment print(id(b2), b2) # 4373768376 bytearray(b'abc\xe4\xbd\xa0\xe5\xa5\xbd') b2[2] = 78 print(id(b2), b2) # 4373768376 bytearray(b'abN\xe4\xbd\xa0\xe5\xa5\xbd') # take b2 Convert to string print(b2.decode('utf8')) # abN Hello # Note the reason why it works here utf8 Turn into unicode,Because N Of ascii Code sum utf8 Agreement b2.extend(bytearray('Added content', encoding='utf8')) # Since it's a variable sequence, bytearray Of course, the way to have a general sequence print(id(b2), b2) # 4373768376 bytearray(b'abN\xe4\xbd\xa0\xe5\xa5\xbd\xe6\xb7\xbb\xe5\x8a\xa0\xe7\x9a\x84\xe5\x86\x85\xe5\xae\xb9') print(b2.decode('utf8')) # abN What do you want to add # PS: You can think of binary sequences as lists. The elements are ascii Encoding (0~255) |
Structure and memory view
''' struct Structured information can be extracted from binary sequences. struct The module provides functions to convert packed byte sequences into tuples of different types of fields, and functions to perform reverse conversion. struct Modules can handle bytes,bytearray,memoryview Object. ''' import struct if __name__ == "__main__": # memoryview Class is used for shared memory and can access other binary sequences, packed arrays, and data slices in buffers without assigning byte sequences fmt = '<3s3sHH' # Format,< Is small byte order, 3 s3s It's two 3-byte sequences, HH Is two 16 bit binary integers with open('L3_chart_python.jpg', 'rb') as f: img = memoryview(f.read()) print(bytes(img[:10])) # b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x02\x00\x1c\x00\x1c\x00\x00' print(struct.unpack(fmt, img[:10])) # (b'\xff\xd8\xff', b'\xe0\x00\x10', 17994, 17993) :Unpacking del img |
Conversion between characters and bytes -- codec
''' python With more than 100 codecs, it is used to convert strings and bytes to each other. //Each code has multiple names, such as' utf_8 ',' utf8 ',' utf-8 ',' U8 ', which can be passed to the encoding parameter ''' if __name__ == "__main__": # Look at the different coding effects for codec in ['gbk', 'utf8', 'utf16']: print(codec, "Hello".encode(codec), sep='\t') ''' gbk b'\xc4\xe3\xba\xc3' utf8 b'\xe4\xbd\xa0\xe5\xa5\xbd' utf16 b'\xff\xfe`O}Y' ''' # Let's decode it print(b'\xc4\xe3\xba\xc3'.decode('gbk')) # Hello print(b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('utf8')) # Hello print(b'\xff\xfe`O}Y'.decode('utf16')) # Hello |
''' //When encountering coding problems, I'm usually upset. Let's see how to solve the coding problems. (1)UnicodeEncodeError (2) UnicodeDecodeError ''' if __name__ == "__main__": # (1)UnicodeEncodeError # Use errors parameter s1 = "hello,You are fat.".encode('latin-1', errors='ignore') print(s1) # b'hello' Use errors='ignore' Characters that cannot be encoded are ignored s2 = "hello,You are fat.".encode('latin-1', errors='replace') print(s2) # b'hello?????' Use errors='replace'Replace characters that cannot be encoded with "hello" s3 = "hello,You are fat.".encode('latin-1', errors='xmlcharrefreplace') print(s3) # b'hello,你长胖啦 'use errors ='xmlcharreplace' to replace content that cannot be encoded with an XML entity # (2) UnicodeDecodeError # Garbled characters are called ghost characters. The following example shows the occurrence of ghost characters s4 = b'Montr\xe9al' print(s4.decode('cp1252')) # Montréal print(s4.decode('iso8859_7')) # Montrιal print(s4.decode('koi8_r')) # MontrИal #print(s4.decode('utf8')) # Error: Unicode decodeerror: 'UTF-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte print(s4.decode('utf8', errors='replace')) # Montr�al |
Continue tomorrow...