Day04 hex and encoding

Posted by toysrfive on Mon, 03 Jan 2022 05:57:32 +0100

Day0 four hex and encoding

Course objective: to explain some necessary common sense knowledge in computer and let students understand the meaning behind some common nouns (focusing on understanding).

Course overview:

  • How python code works
  • Base system
  • Units in a computer
  • code

1.Python code operation mode

  • Scripted

    python3 ~/PycharmProjects/day03/6.Explanation of homework questions.py
    
  • interactive

    python3
    

2. Binary

All data at the bottom of the computer exists in the form of 010101 (picture, text, video, etc.).

  • Binary

    0
    1
    10
    

  • octal number system

  • decimal system

  • hexadecimal

2.1 binary conversion

25
vl = bin(25)  # Replace decimal with binary
v2 = oct(25)  # Replace decimal With Octal
v3 = hex(25)  # Convert decimal to hexadecimal
i1 = int("0b11001", base=2)   # Binary to decimal
i1 = int("0b11001", base=8)   # Binary to decimal
i1 = int("0b11001", base=16)  # Replace binary with hexadecimal

3. Units in the computer

Since virtually everything in the computer is stored and operated as binary, in order to facilitate the representation of binary size, some units have been made

  • b (bit), bit

    1 # One
    10 # both of you
    110 # Three
    
  • B (byte), byte

    8 Bit is a byte
    
    10010110  # 1 byte
    10010110  10010110  # 2 bytes
    
  • KB, (kilobyte), kilobyte

    1024 A byte is a kilobyte
    1KB = 1024B
    10101010 10101010 .....   1KB
    1KB = 1024B= 1024 * 8 b
    
  • M (Megabyte), Megabyte

    1MB = 1024KB
    1M = 1024KB = 1024*1024B = 1024*1024*8b
    
  • G (Gigabyte), Gigabit

    1024M = 1GB
    
  • T (Terabyte), Terabyte

    1024GB = 1TB
    
  • ... Other larger units PB/EB/ZB/YB/BB/NB/DB will not be repeated.

4. Coding

Encoding, a correspondence table between text and binary.

4.1 ASCII encoding

ascii specifies that one byte is used to represent the correspondence between letters and binary.

0000000
00000001    w
00000010    B
00000011    a
...
11111111

2**8 = 256

4.2 gb-2312 coding

gb-2312 code, produced by the National Information Standards Committee (1980).

gbk code extends gb2312 to include Chinese, Japanese and Korean characters (1995).

When corresponding to binary, the following logic is used:

  • Single byte representation, with one byte representing the corresponding relationship. 2 * * 8 = 256 possibilities
  • Two bytes represent, and two bytes represent the corresponding relationship. 2 * * 16 = 65536 possibilities.

4.3 unicode

unicode, also known as universal code, assigns a code point (binary representation) to every word in the world.

  • ucs2

    Use a fixed two bytes to represent a text.
    
    00000000 00000000     Comprehend
    ...
    
    2**16 = 65535
    
  • ucs4

    Use a fixed 4 bytes to represent a text.
    00000000 00000000 00000000 00000000  nothing
    ...
    2**32 = 4294967296
    
written words     hexadecimal            Binary 
 ȧ        0227           1000100111
 ȧ        0227         00000010 00100111                       ucs2
 ȧ        0227         00000000 00000000 00000010 00100111     ucs4
 
 Joe       4E54           100111001010100
 Joe       4E54         01001110 01010100                       ucs2
 Joe       4E54         00000000 00000000 01001110 01010100     ucs4
 
 😆      1F606        11111011000000110
 😆      1F606        00000000 00000001 11110110 00000110      ucs4

Both ucs2 and ucs4 have disadvantages: waste of space?

written words     hexadecimal     Binary
A        0041      01000001
A        0041      00000000 01000001
A        0041      00000000 00000000 00000000 01000001

Application of unicode: in file storage and network transmission, unicode will not be used directly, but in memory.

4.4 utf-8 coding

Containing the correspondence between all words and binary, it is the most widely used coding in the world (standing on the shoulders of Giants).

In essence: utf-8 is a compression of Unicode, which uses as few binaries as possible to correspond to text. When storing data, Unicode is converted to utf-8 for storage and optimization

  unicode Code point range            utf-8      
   0000 ~ 007F              Expressed in 1 byte
   0080 ~ 07FF              Represented by 2 bytes
   0800 ~ FFFF              Represented by 3 bytes
  10000 ~ 10FFFF            Represented by 4 bytes

Specific compression process:

  • Step 1: select a conversion template

      Code point range (HEX)                Conversion template
       0000 ~ 007F              0XXXXXXX
       0080 ~ 07FF              110XXXXX 10XXXXXX
       0800 ~ FFFF              1110XXXX 10XXXXXX 10XXXXXX
      10000 ~ 10FFFF            11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
      
      For example:
          "B"  Corresponding unicode If the code point is 0042, it should select a template.
          "ǣ"  Corresponding unicode The code point is 01 E3,The second template should be selected.
          "martial" Corresponding unicode The code point is 6 B66,The third template should be selected.
           😆  Corresponding unicode The code point is 1 F606,The fourth template should be selected.            
    
    Note: generally, the third template (3 bytes) is used in Chinese, which means that people usually speak Chinese in English utf-8 The reason why it will occupy 3 bytes in.
    
  • Step 2: fill in data in the template

    - "martial"  ->  6B66  ->  110 101101 100110
    - Insert data according to template
    	1110XXXX 10XXXXXX 10XXXXXX
    	1110XXXX 10XXXXXX 10100110
    	1110XXXX 10101101 10100110
    	11100110 10101101 10100110
     stay UTF-8 Code "Wu" 11100110 10101101 10100110
    
    - 😆  ->  1F606  ->  11111 011000 000110
    - Insert data according to template
    	11110000 10011111 10011000 10000110
    

4.5 Python related coding

String( str)     "Hammer"             unicode handle               Generally in memory
 Byte( byte)      b"alexfdsfdsdfskdfsd"      utf-8 code or gbk code       Generally used for file or network processing
v1 = "martial"

# Convert Unicode processed encoding to utf-8 encoding
v2 = "martial".encode("utf-8")

# Convert Unicode processed encoding to gbk encoding
v2 = "martial".encode("gbk")

Writes a string to a file.

name = "Hammer"

# Use utf-8 format for storage
data = name.encode("utf-8")

# Open a file
file_object = open("log.txt",mode="wb")
# Write content in file
file_object.write(data)
# Close file
file_object.close()

summary

The knowledge points in this chapter are mainly understanding. Understanding these foundations is conducive to the learning of the following knowledge points. Next, all the knowledge points in this section are summarized:

  1. Everything on the computer will eventually be converted into binary and run.

  2. ascii encoding, unicode character set and utf-8 encoding are essentially the relationship between characters and binary.

    • ascii, character and binary comparison table.
    • Comparison table of unicode, character and binary (code point).
    • utf-8 compresses the code points of unicode character set, and indirectly maintains the comparison table between characters and binary.
  3. ucs2 and ucs4 refer to the number of bytes used to represent the code point of the unicode character set.

  4. At present, the most extensive coding is utf-8, which can represent all characters, and the storage or network transmission will not waste resources (the code bits are compressed).

  5. Binary, octal, decimal and hexadecimal are actually different times of carry.

  6. Realize the conversion between binary, octal, decimal and hexadecimal based on Python.

  7. One byte 8 bits

  8. The relationship between common units b/B/KB/M/G in computer.

  9. For Chinese characters, 2 bytes are required for gbk encoding; Encoding with utf-8 requires 3 bytes.

  10. Converting strings to bytes (utf-8 encoding) based on Python implementation

    # String type
    name = "Hammer"
    
    print(name) # Hammer
    # Convert string to byte type
    data = name.encode("utf-8")
    print(data) # b'\xe9\x93\x81\xe9\x94\xa4'
    
    # Convert bytes to strings
    old = data.decode("utf-8")
    print(old)
    
  11. Converting strings to bytes (gbk encoding) based on Python implementation

    # String type
    name = "Hammer"
    print(name) # Hammer
    # Convert string to byte type
    data = name.encode("gbk")
    # print(data) # B '\ xe9 \ X93 \ x81 \ xe9 \ x94 \ Xa4' utf8, Chinese 3 bytes
    print(data) #B '\ Xcc \ xfa \ xb4 \ xb8' GBK, Chinese 2 bytes
    
    # Convert bytes to strings
    old = data.decode("gbk")
    print(old)