"Python core technology and practice" Python "black box": input and output?

Posted by laeelin on Fri, 10 Sep 2021 09:19:14 +0200

Python "black box": input and output?

1, Input / output basis

Python program is also a black box: data is delivered through the input stream and processed data is sent out through the output stream. Maybe there is a person hidden behind the Python interpreter or a Slytherin? No one cares.
The simplest direct input comes from keyboard operation, such as the following example.

name = input('your name:')
gender = input('you are a boy?(y/n)')

###### input ######
your name:Jack
you are a boy?

welcome_str = 'Welcome to the matrix {prefix} {name}.'
welcome_dic = {
    'prefix': 'Mr.' if gender == 'y' else 'Mrs',
    'name': name
}

print('authorizing...')
print(welcome_str.format(**welcome_dic))

########## output ##########
authorizing...
Welcome to the matrix Mr. Jack.

Note: adding an asterisk in front of the list is to unpack the list into two independent parameters. Pass in the function. Adding two asterisks in front of the dictionary is to unpack the dictionary into independent elements as formal parameters.
You can refer to [list in python, function of adding * before dictionary]: https://blog.csdn.net/qq_40801709/article/details/107742244

The input() function pauses the program and waits for keyboard input; Until the enter is pressed, the parameter of the function is the prompt, and the input type is always string type (str). Note that beginners are easy to make mistakes here. I will talk about the following examples. The print() function accepts the output of strings, numbers, dictionaries, lists, and even some custom classes.

a = input()
1
b = input()
2

print('a + b = {}'.format(a + b))
########## output ##############
a + b = 12
print('type of a is {}, type of b is {}'.format(type(a), type(b)))
########## output ##############
type of a is <class 'str'>, type of b is <class 'str'>
print('a + b = {}'.format(int(a) + int(b)))
########## output ##############
a + b = 3

Note here that to convert str to int, use int() and to convert to floating point number, use float(). When using cast in the production environment, please remember to add try except ion (i.e. error and exception handling, which will be discussed later in the column).
Python has no maximum limit on int type (in contrast, the maximum int in C + + is 2147483647, which will cause overflow), but it still has precision limit on float type. In addition to paying attention to these features in some algorithm competitions, we should always be on guard in the production environment to avoid bug s or even 0day (critical security vulnerabilities) caused by unclear judgment of boundary conditions.

2, File input / output

The input and output of the command line is only the most basic way of Python interaction, which is suitable for the interaction of some simple applets. For production level Python code, most I/O comes from files, networks, messages from other processes, and so on.
Next, let's analyze the reading and writing of a text file in detail. Suppose we have a text file in.txt, which is as follows:

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

OK, let's do a simple NLP (natural language processing) task. If you don't know much about it and it doesn't matter, I'll take you step by step to complete the task.
First of all, we need to know the basic steps of NLP task, that is, the following four steps:

Read the file in.txt;
Remove all punctuation and line breaks and change all uppercase to lowercase;
Merge the same words, count the frequency of each word, and sort according to the word frequency from large to small;
Output the results as lines to the file out.txt.

You can think for yourself about how to solve this problem in Python. Here, I also give my code with detailed comments. Let's take a look at this code.

import re

# You don't have to care much about this function
def parse(text):
    # Use regular expressions to remove punctuation and line breaks
    # \W means to find word characters (match any word characters including underscores); [^ \ w] indicates characters not in \ W;
    text = re.sub(r'[^\w ]', ' ', text)  # [] means to match the characters listed in []
    print(text)

    # Convert to lowercase
    text = text.lower()
    
    # Generate a list of all words
    word_list = text.split(' ')
    
    # Remove blank words
    word_list = filter(None, word_list)
    
    # Generate a dictionary of words and word frequencies
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1
    
    # Sort by word frequency
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)
    
    return sorted_word_cnt

with open('/dataNew/kaifang/in.txt', 'r') as fin:
    text = fin.read()
    print(text)

word_and_freq = parse(text)

with open('in.txt', 'w') as fout:
    for word, freq in word_and_freq:
        fout.write('{} {}\n'.format(word, freq))

You don't need to care about the specific implementation of parse() function. You just need to know that what it does is convert the input text string into the sorted word frequency statistics we need. And sorted_word_cnt is a list of tuples.
First, we need to understand the basic knowledge of file access in the computer. In fact, the processing of files by the computer kernel is relatively complex, involving a series of concepts such as kernel mode, virtual file system, lock and pointer. I will not explain these contents in depth. I will only say some basic but sufficient knowledge.
We first need to use the open() function to get the pointer to the file. The first parameter specifies the file location (relative location or absolute location); The second parameter, if 'r' indicates reading, if 'w' indicates writing, of course, 'rw' can also be used to indicate that both reading and writing are required. A is a less commonly used (but also useful) parameter, indicating append. If the open file needs to be written, it will be written from the end of the original file.
OK, back to our topic. After getting the pointer, we can read all the contents of the file through the read () function. The code text = fin.read(), which means that all contents of the file are read into memory and assigned to the variable text. Naturally, there are advantages and disadvantages:

The advantage is convenience. Next, we can easily call parse function for analysis;
The disadvantage is that if the file is too large, one-time reading may cause memory corruption.

At this time, we can specify the parameter size for read to represent the maximum length of the read. You can also read one line at a time through the readline() function. This method is often used for data cleaning in Data Mining. It is very light when writing some small programs. If there is no correlation between each row, this can also reduce the pressure on memory. The write() function can output the string in the parameter to a file, which is easy to understand.
Finally, it should be noted that all I/O should be error handled. Because I/O operations may have a variety of situations, and a robust program needs to be able to deal with all kinds of situations without crashing (except for intentional design).

Python regular expressions refer to: https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

b_miss=[None, None, None, None, 2, 8, 1, None, None, 0]
print(b_miss)
b_filter = filter(None, b_miss)
b_miss_filter = list(b_filter)
print(b_miss_filter)
# [None, None, None, None, 2, 8, 1, None, None, 0]
# [2, 8, 1]
############################################
b_miss = ['a', 'dream', 'today', '', '']
print(b_miss)
b_filter = filter(None, b_miss)
b_miss_filter = list(b_filter)
print(b_miss_filter)
# ['a', 'dream', 'today', '', '']
# ['a', 'dream', 'today']

3, JSON serialization and Practice

JSON (JavaScript Object Notation) is a lightweight data exchange format. Its design intention is to represent everything with a designed string, which is not only convenient for transmitting information on the Internet, but also convenient for people to read (compared with some binary protocols). JSON is widely used in today's Internet, and it is also a skill point that every Python programmer should master.

Imagine a situation where you want to buy a certain amount of shares from the exchange. Then, you need to submit a series of parameters such as stock code, direction (buy / sell), order type (market price / limit price), price (if it is a limit order), quantity, etc. in these data, there are strings, integers, floating-point numbers, and even Boolean variables. All of them are mixed together, which is not convenient for the exchange to unpack.
What should I do? In fact, the JSON we want to talk about can solve this scenario. You can simply understand it as two kinds of black boxes:

First, input these miscellaneous information, such as Python dictionary, and output a string;
Second, by entering this string, you can output a Python dictionary containing the original information.

import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

params_str = json.dumps(params) # json.dumps() function, which accepts Python's basic data type, and then serializes it into string;

print('after json serialization')
print('type of params_str = {}, params_str = {}'.format(type(params_str), params))

original_params = json.loads(params_str) # The json.loads() function accepts a legal string and then deserializes it into Python's basic data type.

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

########## output ##########

# after json serialization
# type of params_str = <class 'str'>, params_str = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}
# after json deserialization
# type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}

The json.dumps() function accepts the basic data type of python, and then serializes it into string; The json.loads() function accepts a legal string and then deserializes it into Python's basic data type.
Is it simple? But still that sentence, please remember to add error handling. Otherwise, even if you just send an illegal string to json.loads() and you don't catch it, the program will crash. At this point, you might think, what if I want to output strings to a file or read JSON strings from a file? Yes, you can still use the above-mentioned open() and read()/write(), first read / output the string to memory, and then encode / decode JSON. Of course, it's a little troublesome.

Differences between Python JSON. Dumps() and JSON. Dump(): https://www.cnblogs.com/wswang/p/5411826.html
First, describe the basic functions:
dumps is to convert dict to str format, and loads is to convert str to dict format.
Dump and load are similar functions, but combined with file operation (dump is less used in practice).

In this way, we simply and clearly implement the process of reading and writing JSON strings. When developing a third-party application, you can output the user's personal configuration to a file through JSON to facilitate automatic reading the next time the program starts. This is also a mature practice widely used now.
So is JSON the only option? Obviously not, it's just one of the most convenient choices in lightweight applications. As far as I know, there is a similar tool called Protocol Buffer in Google. Of course, Google has completely opened this tool. You can find out how to use it yourself.
Compared with JSON, its advantage is to generate optimized binary files, so its performance is better. But at the same time, the generated binary sequence cannot be read directly. It is widely used in TensorFlow and many other systems with performance requirements.

4, Summary

I/O operation shall be cautious, sufficient error handling shall be carried out, and coding shall be carried out carefully to prevent coding loopholes;
When encoding, the memory and disk occupation should be fully estimated, so that it is easier to find the cause in case of error;
JSON serialization is a very convenient tool, which should be practiced in combination with actual combat;
The code should be as concise and clear as possible. Even at the beginning of learning, you should have a heart to be a marshal.

Topics: Python Algorithm

Programmer Think

"Python core technology and practice" Python "black box": input and output?

1, Input / output basis

2, File input / output

3, JSON serialization and Practice

4, Summary

Hot Topics