Advanced tutorials on Python crawlers and data analysis: file manipulation, lambda expressions, recursion, yield generator

Posted by dormouse1976 on Wed, 15 May 2019 10:49:26 +0200

Python crawler and python teaching video of data analysis, Python source code sharing, Python

Python crawler and basic data analysis tutorial: Python grammar, dictionary, tuples, lists

Advanced tutorials on Python crawlers and data analysis: file manipulation, lambda expressions, recursion, yield generator

Python crawler and data analysis module: built-in module, open source module, custom module

Python crawler and data analysis crawler skills: urlib library, xpath selector, regular expression

Python Reptiles and Data Analysis of Jingdong Reptiles: Climbing Jingdong Commodities and Storing them in sqlite3 Database

Python Crawler and Data Analysis Python Open Source Crawler Project Summary

 

python commonly used built-in functions:

 

 

File operation

When operating files, the following steps are generally required:

  • Open file
  • Operation file
  • Close file

I. Opening Documents

1

File handle = file('file path','mode')

Note: There are two ways to open a file in python: open(...) and file(...). In essence, the former calls the latter internally for file operation, and it is recommended to use open.

When opening a file, you need to specify the file path and how to open the file. After opening, you can get the file handle, and then operate on the file through the file handle.

The mode of opening the file is:

  • r, read-only mode (default).
  • w, write-only mode. [Unreadable; Create if nonexistent; Delete if present;]
  • a, append mode. [Readable; Create if nonexistent; append only if existing;]

"+" means that a file can be read and written at the same time

  • r+, readable and writable files. [Readable; Writable; Additive]
  • w+, write and read
  • A+, same as a

"U" means that r n r r n r n can be automatically converted to n (used in conjunction with R or R + mode) when read.

  • rU
  • r+U

"b" means processing binary files (e.g. FTP sending and uploading ISO image files, linux negligible, and windows annotating binary files)

  • rb
  • wb
  • ab

II. Operational Functions

 1 class file(object):
 2   
 3     def close(self): # real signature unknown; restored from __doc__
 4         Close file
 5         """
 6         close() -> None or (perhaps) an integer.  Close the file.
 7          
 8         """
 9  
10     def fileno(self): # real signature unknown; restored from __doc__
11         File descriptor  
12          """
13         fileno() -> integer "file descriptor".
14          
15         This is needed for lower-level file interfaces, such os.read().
16         """
17         return 0    
18  
19     def flush(self): # real signature unknown; restored from __doc__
20         Refresh File Internal Buffer
21         """ flush() -> None.  Flush the internal I/O buffer. """
22         pass
23  
24  
25     def isatty(self): # real signature unknown; restored from __doc__
26         Determine whether the document agrees? tty equipment
27         """ isatty() -> true or false.  True if the file is connected to a tty device. """
28         return False
29  
30  
31     def next(self): # real signature unknown; restored from __doc__
32         Get the next row of data. If it does not exist, it will report an error.
33         """ x.next() -> the next value, or raise StopIteration """
34         pass
35  
36     def read(self, size=None): # real signature unknown; restored from __doc__
37         Read specified byte data
38         """
39         read([size]) -> read at most size bytes, returned as a string.
40          
41         """
42         pass
43  
44     def readinto(self): # real signature unknown; restored from __doc__
45         Read to buffer, do not use, will be abandoned
46         """ readinto() -> Undocumented.  Don't use this; it may go away. """
47         pass
48  
49     def readline(self, size=None): # real signature unknown; restored from __doc__
50         Read only one row of data
51         """
52         readline([size]) -> next line from the file, as a string.
53          """
54         pass
55  
56     def readlines(self, size=None): # real signature unknown; restored from __doc__
57         Read all the data and save the list of values according to the newline
58         """
59         readlines([size]) -> list of strings, each a line from the file.
60         """
61         return []
62  
63     def seek(self, offset, whence=None): # real signature unknown; restored from __doc__
64         Specify the pointer position in the file
65         """
66         seek(offset[, whence]) -> None.  Move to new file position.
67         """
68         pass
69  
70     def tell(self): # real signature unknown; restored from __doc__
71         Get the current pointer position
72         """ tell() -> current file position, an integer (may be a long integer). """
73         pass
74  
75     def truncate(self, size=None): # real signature unknown; restored from __doc__
76         Truncate data, retaining only data before specifying
77         """
78         pass
79  
80     def write(self, p_str): # real signature unknown; restored from __doc__
81         Writing content
82         """
83         write(str) -> None.  Write string str to file.
84         """
85         pass
86  
87     def writelines(self, sequence_of_strings): # real signature unknown; restored from __doc__
88         Write a list of strings to a file
89         """
90         writelines(sequence_of_strings) -> None.  Write the strings to the file.
91         """
92         pass
93  
94     def xreadlines(self): # real signature unknown; restored from __doc__
95         Can be used to read files line by line, not all
96         """
97         xreadlines() -> returns self.
98         """
99         pass

 

Three, with

To avoid forgetting to close after opening a file, you can manage the context by:

1

2

3

with open('log','r') as f:     

    ...

In this way, when the with code block is executed, the internal file resources are automatically closed and released.

After Python 2.7, with also supports managing the context of multiple files at the same time, that is:

1

2

with open('log1') as obj1, open('log2') as obj2:

    pass

 

4. Examples of python file operation

Custom function

Background

Before learning functions, we always follow the following principles: process-oriented programming, i.e. implementing functions from top to bottom according to business logic, which often uses a long piece of code to achieve the specified functions. The most common operation in the development process is paste copy, that is to say, copying blocks of code previously implemented to existing functions, as follows

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

while True:

if cpu utilization > 90%:

# Send email reminders

Connect to Mailbox Server

Send mail

Close the connection

  

if hard disk usage space > 90%:

# Send email reminders

Connect to Mailbox Server

Send mail

Close the connection

  

if memory occupancy > 80%:

# Send email reminders

Connect to Mailbox Server

Send mail

Close the connection

Looking at the above code, the content under the if conditional statement can be extracted for public use, as follows:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

def Send mail (content)

# Send email reminders

Connect to Mailbox Server

Send mail

Close the connection

  

while True:

  

if cpu utilization > 90%:

Send mail ('CPU Alarm')

  

if hard disk usage space > 90%:

Send mail ('Hard Disk Alarm')

  

if memory occupancy > 80%:

For the above two implementations, the second must be better than the first in reusability and readability. In fact, this is the difference between functional programming and process-oriented programming.

  • Functional Formula: Encapsulate a function code into a function, and then it will not need to be repeated in the future, just call the function.
  • Object-Oriented: Classifying and encapsulating functions to make development "faster, better and stronger..."

The most important thing in functional programming is to enhance code reusability and readability

2. Definition and Use of Functions

1

2

3

4

5

def: Function name (parameter):

     

    ...

Function body

    ...

The definition of function mainly includes the following points:

  • def: Keyword for function
  • Function Name: The name of the function, which is then called according to the function name.
  • Function body: A series of logical calculations are carried out in the function, such as sending mail, calculating the maximum number in [11, 22, 38, 888, 2], etc.
  • Parameters: Provide data for function bodies
  • Return value: When the function has been executed, it can return data to the caller.

Among the above points, parameters and return values are more important:

1. Return value

Function is a function block. Whether the function is successfully executed or not requires a return value to inform the caller.

2, parameters

The function has three different parameters:

  • General parameter
  • Default parameters
  • dynamic parameter

 View Code

 

lambda expressions

When learning conditional operations, for simple if else statements, ternary operations can be used to represent them, namely:

1

2

3

4

5

6

7

8

# Common conditional statement

if 1 == 1:

    name = 'wupeiqi'

else:

    name = 'alex'

  

# ternary operation

name = 'wupeiqi' if 1 == 1 else 'alex'

For simple functions, there is also a simple way of expression, that is, lambda expression.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

####################### Ordinary function######################

# Define functions (in general)

def func(arg):

    return arg + 1

  

# Execution function

result = func(123)

  

# ###################### lambda ######################

  

# Definition function (lambda expression)

my_lambda = lambda arg : arg + 1

  

# Execution function

result = my_lambda(123)

The meaning of lambda's existence is a concise representation of simple functions.

Built-in function 2

I. map

Traversing the sequence, each element in the sequence is manipulated, and finally a new sequence is obtained.

 

1 li = [11, 22, 33]
2  
3 new_list = map(lambda a: a + 100, li)
4  
5 
6 li = [11, 22, 33]
7 sl = [1, 2, 3]
8 new_list = map(lambda a, b: a + b, li, sl)

 

Two, filter

The elements in the sequence are screened and the qualified sequence is finally obtained.

1  
2 li = [11, 22, 33]
3  
4 new_list = filter(lambda arg: arg > 22, li)
5  
6 #filter The first parameter is empty, and the original sequence is obtained.

 

Three, reduce

Accumulate all elements in a sequence

 

li = [11, 22, 33]
 
result = reduce(lambda arg1, arg2: arg1 + arg2, li)
 
# reduce The first parameter, the function must have two parameters
# reduce The second parameter, the sequence of loops
# reduce The third parameter, the initial value

 

yield generator

1. Contrast the difference between range and xrange

1

2

3

4

>>> print range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> print xrange(10)

xrange(10)

As shown in the code above, ranges create all the specified numbers in memory, while xrange s do not create immediately, creating each array only when iterating through the loop.

 

1 def nrange(num):
2     temp = -1
3     while True:
4         temp = temp + 1
5         if temp >= num:
6             return
7         else:
8             yield temp

 

2. The Difference between read and xreadlinex in File Operation

1

2

read reads everything into memory

xreadlines are acquired only in circular iterations

 

 1 def NReadlines():
 2     with open('log','r') as f:
 3         while True:
 4             line = f.next()
 5             if line:
 6                 yield line
 7             else:
 8                 return
 9  
10 for i in NReadlines():
11     print i
12  
13 
14  def NReadlines():
15     with open('log','r') as f:
16         seek = 0
17         while True:
18             f.seek(seek)
19             data = f.readline()
20             if data:
21                 seek = f.tell()
22                 yield data
23             else:
24                 return
25  
26 for item in NReadlines():
27     print item
28  

 

Decorator

Decorator is a function, but it can have a special meaning. Decorator is used to decorate function or class. Decorator can add corresponding operations before and after function execution.

1

2

3

4

5

6

7

8

9

10

def wrapper(func):

    def result():

        print 'before'

        func()

        print 'after'

    return result

 

@wrapper

def foo():

    print 'foo'

 

 1 import functools
 2  
 3  
 4 def wrapper(func):
 5     @functools.wraps(func)
 6     def wrapper():
 7         print 'before'
 8         func()
 9         print 'after'
10     return wrapper
11  
12 @wrapper
13 def foo():
14     print 'foo'
15  
16 
17 #!/usr/bin/env python
18 #coding:utf-8
19  
20 def Before(request,kargs):
21     print 'before'
22      
23 def After(request,kargs):
24     print 'after'
25  
26  
27 def Filter(before_func,after_func):
28     def outer(main_func):
29         def wrapper(request,kargs):
30              
31             before_result = before_func(request,kargs)
32             if(before_result != None):
33                 return before_result;
34              
35             main_result = main_func(request,kargs)
36             if(main_result != None):
37                 return main_result;
38              
39             after_result = after_func(request,kargs)
40             if(after_result != None):
41                 return after_result;
42              
43         return wrapper
44     return outer
45      
46 @Filter(Before, After)
47 def Index(request,kargs):
48     print 'index'
49      
50      
51 if __name__ == '__main__':
52     Index(1,2)
53  

 

Bubble algorithm

Requirements: Sort the list [13, 22, 6, 99, 11] from small to large

Idea: Compare the two adjacent values, place the larger values on the right side, and compare them in turn!

 

 1  li = [13, 22, 6, 99, 11]
 2  
 3 for m in range(4):     # Equivalent to #for m in range(len(li)-1):
 4     if li[m]> li[m+1]:
 5         temp = li[m+1]
 6         li[m+1] = li[m]
 7         li[m] = temp
 8   
 9 
10 li = [13, 22, 6, 99, 11]
11  
12 for m in range(4):     # Equivalent to #for m in range(len(li)-1):
13     if li[m]> li[m+1]:
14         temp = li[m+1]
15         li[m+1] = li[m]
16         li[m] = temp
17  
18 for m in range(3):     # Equivalent to #for m in range(len(li)-2):
19     if li[m]> li[m+1]:
20         temp = li[m+1]
21         li[m+1] = li[m]
22         li[m] = temp
23  
24 for m in range(2):     # Equivalent to #for m in range(len(li)-3):
25     if li[m]> li[m+1]:
26         temp = li[m+1]
27         li[m+1] = li[m]
28         li[m] = temp
29  
30 for m in range(1):     # Equivalent to #for m in range(len(li)-4):
31     if li[m]> li[m+1]:
32         temp = li[m+1]
33         li[m+1] = li[m]
34         li[m] = temp
35 print li
36  
37 
38 li = [13, 22, 6, 99, 11]
39  
40 for i in range(1,5):
41     for m in range(len(li)-i): 
42         if li[m] > li[m+1]:
43             temp = li[m+1]
44             li[m+1] = li[m]
45             li[m] = temp
46  

 

recursion

Write the following sequence with functions:

Fibonacci sequence refers to a sequence of 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368.

 

1 def func(arg1,arg2):
2     if arg1 == 0:
3         print arg1, arg2
4     arg3 = arg1 + arg2
5     print arg3
6     func(arg2, arg3)
7  
8 func(0,1)

 

 

Notice

More python source code, video tutorials, welcome to pay attention to the public number: Nancheng Old Dream

> Zero Start Big Data and Quantitative Analysis PDF and Tutorial Source Code
> Using python for data analysis PDF and supporting source code
> Python Financial Application Programming (Data Analysis, Pricing and Quantitative Investment) Lectures and Source Codes for Big Data Projects
> Dong Fuguo's Python Teaching Video
1. Development of Classroom Teaching Management System: Design and Implementation of Online Examination Function
2. Python+pillow image programming;
3. Python+Socket programming
4. Python+tkinter development;
5. Visualization of Python Data Analysis and Scientific Computing
6. Python file operations
7. Python Multithread and Multiprocess Programming
8. Python strings and regular expressions
.....

> Data Analysis Teaching Video
1. Easy control of Statistics - essential skills for data analysis (12 episodes);
2. Easy to use Tableau software - visualize data (9 sets);
3. Competition analysis strategy (6 episodes);
4. Electronic Commerce Data Operation - Three Data Tool Applications (20 sets);

> Big Data (Video and Programs)
1. hadoop
2. Scala
3. spark

> Python Web Crawler Sharing Series PDF

[thousand front] Python crawler from entry to mastery (essence version) (92 episodes)

Welcome to pay attention to the public number for learning resources: Nancheng Old Dream

Topics: Python Lambda Programming Big Data