python advanced (4) - text and byte sequence (coding problem)

Posted by misslilbit02 on Mon, 06 Apr 2020 05:04:58 +0200

Main content of this paper



Structure and memory view

Conversion between characters and bytes -- codec

BOM ghost character

Continue tomorrow...


python advanced - Directory

The code in this article is placed on github: advanced



    Character encoding is a problem that often troubles python programmers. I often encounter this headache in the process of writing crawlers.

    Starting from Python 3, we have made a clear distinction between human language (text string) and machine language (binary byte). Let's start with text string
    Before you start, define the "character":
        Character: Unicode character, the element obtained from the str object of python3 is Unicode character
        String: a string is a sequence of characters (in this case, the contents in (1) correspond to each other)


if __name__ == "__main__":
    #Create character
    s1 = str('a')
    s2 = 'b'
    s3 = u'c'
    print(s1, s2, s3)      # a b c


At this time, just remember that in Python 3, the character is unicode, that is, str is unicode, which is a language human can understand.



    python3 There are two basic types of binary sequences built into: immutable bytes And variable bytearray
        (1)bytes and bytearray The elements of are between 0~255(8 individual bit)Integer between;
        (2)Slices of binary sequences are always binary sequences of the same type

if __name__ == "__main__":
    # Establish bytes and bytearray
    b1 = bytes('abc Hello', encoding='utf8')      # about encode Later, I'll say that I don't know if anyone, like me, always confuses the direction of coding and decoding
    print(b1)          # b'abc\xe4\xbd\xa0\xe5\xa5\xbd'

    b2 = bytearray('abc Hello', encoding='utf8')
    print(b2)          # bytearray(b'abc\xe4\xbd\xa0\xe5\xa5\xbd')

    # Slice (hint: all sequences can be sliced)
    print(b1[3:5])     # b'\xe4\xbd'
    print(b2[3:5])     # bytearray(b'\xe4\xbd')

    # Try using the method of list value
    print(b1[3])       # 228 At this point, it is not a byte sequence, but an element
    for _ in b1:
        print(_, end=',')   # 97,98,99,228,189,160,229,165,189,      These are all 8 bit Integer

    # bytes Immutable vs. bytearray Variable

    # b1[3] = 160           # Error reported: 'bytes' object does not support item assignment
    print(id(b2), b2)      # 4373768376 bytearray(b'abc\xe4\xbd\xa0\xe5\xa5\xbd')
    b2[2] = 78
    print(id(b2), b2)      # 4373768376 bytearray(b'abN\xe4\xbd\xa0\xe5\xa5\xbd')

    # take b2 Convert to string
    print(b2.decode('utf8'))  # abN Hello
                              # Note the reason why it works here utf8 Turn into unicode,Because N Of ascii Code sum utf8 Agreement
    b2.extend(bytearray('Added content', encoding='utf8'))  # Since it's a variable sequence, bytearray Of course, the way to have a general sequence
    print(id(b2), b2)         # 4373768376 bytearray(b'abN\xe4\xbd\xa0\xe5\xa5\xbd\xe6\xb7\xbb\xe5\x8a\xa0\xe7\x9a\x84\xe5\x86\x85\xe5\xae\xb9')

    print(b2.decode('utf8'))  # abN What do you want to add

    # PS: You can think of binary sequences as lists. The elements are ascii Encoding (0~255)


Structure and memory view

    struct Structured information can be extracted from binary sequences.
    struct The module provides functions to convert packed byte sequences into tuples of different types of fields, and functions to perform reverse conversion.
    struct Modules can handle bytes,bytearray,memoryview Object.

import struct

if __name__ == "__main__":
    # memoryview Class is used for shared memory and can access other binary sequences, packed arrays, and data slices in buffers without assigning byte sequences
    fmt = '<3s3sHH'   # Format,< Is small byte order, 3 s3s It's two 3-byte sequences, HH Is two 16 bit binary integers

    with open('L3_chart_python.jpg', 'rb') as f:
        img = memoryview(

    print(bytes(img[:10]))    # b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x02\x00\x1c\x00\x1c\x00\x00'
    print(struct.unpack(fmt, img[:10]))   # (b'\xff\xd8\xff', b'\xe0\x00\x10', 17994, 17993)  :Unpacking

    del img


Conversion between characters and bytes -- codec

    python With more than 100 codecs, it is used to convert strings and bytes to each other.
    //Each code has multiple names, such as' utf_8 ',' utf8 ',' utf-8 ',' U8 ', which can be passed to the
    encoding parameter

if __name__ == "__main__":
    # Look at the different coding effects
    for codec in ['gbk', 'utf8', 'utf16']:
        print(codec, "Hello".encode(codec), sep='\t')
                        gbk      b'\xc4\xe3\xba\xc3'
                        utf8    b'\xe4\xbd\xa0\xe5\xa5\xbd'
                        utf16    b'\xff\xfe`O}Y'
    # Let's decode it
    print(b'\xc4\xe3\xba\xc3'.decode('gbk'))            # Hello
    print(b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode('utf8'))   # Hello
    print(b'\xff\xfe`O}Y'.decode('utf16'))              # Hello
    //When encountering coding problems, I'm usually upset. Let's see how to solve the coding problems.
     (2) UnicodeDecodeError

if __name__ == "__main__":
    # (1)UnicodeEncodeError
    # Use errors parameter
    s1 = "hello,You are fat.".encode('latin-1', errors='ignore')
    print(s1)   # b'hello'    Use errors='ignore' Characters that cannot be encoded are ignored

    s2 = "hello,You are fat.".encode('latin-1', errors='replace')
    print(s2)   # b'hello?????'    Use errors='replace'Replace characters that cannot be encoded with "hello"

    s3 = "hello,You are fat.".encode('latin-1', errors='xmlcharrefreplace')
    print(s3)   # b'hello&#65292;&#20320;&#38271;&#32982;&#21862; 'use errors ='xmlcharreplace' to replace content that cannot be encoded with an XML entity

    # (2) UnicodeDecodeError
    # Garbled characters are called ghost characters. The following example shows the occurrence of ghost characters

    s4 = b'Montr\xe9al'
    print(s4.decode('cp1252'))    # Montréal
    print(s4.decode('iso8859_7')) # Montrιal
    print(s4.decode('koi8_r'))    # MontrИal
    #print(s4.decode('utf8'))      # Error: Unicode decodeerror: 'UTF-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte
    print(s4.decode('utf8', errors='replace'))   # Montr�al


Continue tomorrow...

python advanced series article directory

python advanced - Directory

Topics: Python codec encoding github