A few days ago, I was really embarrassed when chatting in the company group. Because there were not enough doutu expression packs, the whole chat atmosphere could not be driven, so I was depressed and frustrated!
In order to enliven the atmosphere, I climbed more than 1000 doutu expression packs.
Considering that some partners may not have a good foundation in python, gnaw Shujun decided to help you supplement the basic knowledge first, and the boss can read the actual combat content directly. The actual combat content of this time is to climb: bucket map.
If you don't want to see these basic knowledge, you can directly pull to the actual combat article at the end of the article.
object-oriented
python is an object-oriented language from the beginning of design, so it is very simple to use python to create a class and object.
If you haven't been exposed to object-oriented programming language before, you need to understand some basic features of object-oriented language. Next, let's feel the object-oriented language of python.
Introduction to object oriented
-
Class: used to describe a collection of objects with the same properties and methods. It defines the properties and methods common to each object in the collection. An object is an instance of a class.
-
Class variables: class variables are public in the entire instantiated object. Class variables are defined in the class and outside the function.
-
Data member: class variable or instance variable, which is used to process the relevant data of the class and its instance object.
-
Method overloading: if the method inherited from the parent class cannot meet the needs of the child class, it can be rewritten. This process is called overriding, also known as method overloading.
-
Instance variable: the variable defined in the method, which only works on the class of the current instance.
-
Inheritance: that is, a derived class inherits the fields and methods of the base class (parent class).
-
Instantiation: create an instance of a class's concrete object.
-
Methods: functions defined in classes
-
Object: an instance of a data structure defined by a class. The object consists of two data members (class variables and instance variables) and methods.
Create classes and objects
Class is equivalent to a template. There can be multiple functions in the template. Functions are used to implement functions.
Object is actually an instance created according to the template. The created instance can execute the functions in the class.
#Create class class Foo(object): #Create a function in a class def bar(self): # todo pass #Create obj object according to Foo class obj = Foo()
-
Class is a keyword that represents a class
-
The object code is the parent class, and all classes inherit the object class
-
Create an object and add parentheses after the class name
Three characteristics of object oriented
encapsulation
Encapsulation, as the name suggests, is to encapsulate the content somewhere, and then call the content encapsulated somewhere.
Therefore, when using object-oriented encapsulation features, you need to:
-
Encapsulate content somewhere
-
Call the encapsulated content from somewhere
class Foo(object): #Construction method, which is automatically executed when creating objects based on classes def __init__(self, name, age): self.name = name self.age = age #Create objects from class Foo #Automatically start the of Foo class__ init__ method obj1 = Foo('Jack', 18) obj2 = Fo('Rose', 20)
obj1 = Foo('Jack', 18) encapsulates Jack and 18 into the name and age attributes of obj1(self), and obj2 is the same.
Self is a formal parameter. When obj1 = Foo('Jack', 18) is executed, self is equal to obj1. Therefore, each object has name and age attributes.
Encapsulating content through object calls
class Foo(object): def __init__(self, name, age): self.name = name self.age = age obj1 = Foo('Jack', 18) print(obj1.name) #Call the name attribute of obj1 print(obj1.age) #Call the age attribute of obj1 obj2 = Foo('Jack', 18) print(obj2.name) #Call the name attribute of obj2 print(obj2.age) #Call the age attribute of obj2
The encapsulated content is called indirectly through self
class Foo(object): def __init__(self, name, age): self.name = name self.age = age def detail(self): print(self.name) print(self.age) obj1 = Foo('Jack', 18) obj1.detail() obj2 = Foo('Rose', 20) obj2.detail()
obj1.detail()python will pass obj1 to the self parameter by default, that is, obj1 Detail (obj1), so self=obj1 in the method, that is, self Name is equivalent to obj1 name.
Feel the simplicity of object-oriented
For object-oriented encapsulation, it is actually to use the construction method to encapsulate the content into the object, and then obtain the encapsulated content directly or indirectly through the object.
Next, let's experience the simplicity of object-oriented.
def kanchai(name, age, gender): print "%s,%s year,%s,Go up the mountain to cut firewood" %(name, age, gender) def qudongbei(name, age, gender): print "%s,%s year,%s,Drive to the northeast" %(name, age, gender) def dabaojian(name, age, gender): print "%s,%s year,%s,Love big health care" %(name, age, gender) kanchai('Xiao Ming', 10, 'male') qudongbei('Xiao Ming', 10, 'male') dabaojian('Xiao Ming', 10, 'male') kanchai('Lao Li', 90, 'male') qudongbei('Lao Li', 90, 'male') dabaojian('Lao Li', 90, 'male') Functional programming class Foo(object): def __init__(self, name, age ,gender): self.name = name self.age = age self.gender = gender def kanchai(self): print "%s,%s year,%s,Go up the mountain to cut firewood" %(self.name, self.age, self.gender) def qudongbei(self): print "%s,%s year,%s,Drive to the northeast" %(self.name, self.age, self.gender) def dabaojian(self): print "%s,%s year,%s,Love big health care" %(self.name, self.age, self.gender) xiaoming = Foo('Xiao Ming', 10, 'male') xiaoming.kanchai() xiaoming.qudongbei() xiaoming.dabaojian() laoli = Foo('Lao Li', 90, 'male') laoli.kanchai() laoli.qudongbei() laoli.dabaojian()
If you use functional programming, you need to pass in the same parameters every time you execute a function. If there are many parameters, you have to copy and paste them every time, which is very inconvenient; For object-oriented, you only need to encapsulate the required parameters into the object when creating the object, and then get the encapsulated content through object call.
inherit
Inheritance means that there is a parent-child relationship between classes. Subclasses can directly access the static properties and methods of the parent class. In python, the newly created class can inherit one or more parent classes. The parent class can be called base class or super class, and the newly created class is called derived class or subclass.
class ParentClass1: #Define parent class 1 pass class ParentClass2: #Define parent class 2 pass class SubClass1(ParentClass1): #Single inheritance. The base class is ParentClass1 and the derived class is SubClass pass class SubClass2(ParentClass1,ParentClass2): #python supports multiple inheritance, separating multiple inherited classes with commas pass print(SubClass1.__bases__) #View all inherited parent classes print(SubClass2.__bases__) # =============== # (<class '__main__.Father1'>,) # (<class '__main__.Father1'>, <class '__main__.Father2'>)
Inherited rules
1. The subclass inherits the member variables and methods of the parent class
2. Subclasses do not inherit the constructor of the parent class
3. Subclasses cannot delete parent class members, but they can redefine parent class members
4. Subclasses can add their own members.
The specific code is as follows:
class Person(object): def __init__(self, name, age, sex): self.name = 'jasn' self.age = 18 self.sex = sex def talk(self): print('I want to say someting to you') class Chinese(Person): def __init__(self, name, age, sex, language): Person.__init__(self, name, age, sex) #Override the attributes of the subclass with the name, age and sex of the parent class self.age = age #Override the age attribute of the parent class and take the value as the age parameter passed in by the child class instance self.language = 'Chinese' def talk(self): print('I speak Mandarin') Person.talk(self) obj = Chinese('nancy', 30, 'male', 'mandarin') print(obj.name) print(obj.age) print(obj.language) obj.talk()
The operation results are as follows:
jasn 30 Chinese I speak Mandarin I want to say someting to you
Because the Chinese class overrides the Person class, at the beginning, we override the attributes of the parent class over the attributes of the child class, such as the name attribute. The child class does not override the parent class. Therefore, even if the child class passes the name attribute value, the name attribute of the parent class is still output.
The role of inheritance
1. Realize code (function) reuse and reduce code redundancy
2. Enhance software scalability
3. Improve software maintainability
Inheritance and abstract concepts
Two important concepts of object-oriented: abstraction and classification.
class animal(): #Define parent class country = 'china' #This is called a class variable def __init__(self,name,age): self.name = name #These are also called data attributes self.age = age def walk(self): #Class functions, methods, dynamic properties print('%s is walking'%self.name) def say(self): pass class people(animal): #The subclass inherits the parent class pass class pig(animal): #The subclass inherits the parent class pass class dog(animal): #The subclass inherits the parent class pass aobama=people('aobama',60) #Instantiate an object print(aobama.name) aobama.walk()
The above code can be understood as follows: we abstract human, dog and pig as animals, and human, dog and pig inherit animal classes.
The function and principle of super() in python
super() is very commonly used in class inheritance. It solves some problems of subclasses calling parent methods. Let's take a look at what it optimizes.
class Foo(object): def bar(self, message): print(message) obj1 = Foo() obj1.bar('hello')
When there is an inheritance relationship, it is sometimes necessary to call the parent class method in the subclass. At this point, the simplest way is to transform the object call into the class call. It is important to note that the self parameter needs to be displayed.
The specific code is as follows:
class FooParent(object): """docstring for FooParent""" def bar(self, message): print(message) class FooChild(FooParent): """docstring for FooChild""" def bar(self, message): FooParent.bar(self, message) foochild = FooChild() foochild.bar('hello')
This inheritance method is actually flawed. For example, if I modify the name of the parent class, many modifications will be involved in the child class.
Therefore, python introduces the super() mechanism. The specific code is as follows:
class FooParent(object): def bar(self, message): print(message) class FooChild(FooParent): def bar(self, message): super(FooChild, self).bar(message) obj = FooChild() obj.bar('hello')
polymorphic
The knowledge of python polymorphism is not used in the actual combat content, so I won't describe it any more. Partners can find out the information by themselves.
What is the producer and consumer model
For example, there are two processes A and B, which share A fixed size buffer, and the production data of process A is put into the buffer; Process B fetches data from the buffer for calculation, so process A here is equivalent to the producer and process B is equivalent to the consumer.
Why use the producer consumer model
In the process world, the producer is the process of producing data, and the consumer is the process of using (processing) data. Similarly, if the processing capacity of the consumer is greater than that of the producer, the consumer must wait for the producer. Similarly, if the processing capacity of the producer is greater than that of the consumer, the producer must wait for the consumer.
It realizes the decoupling and between producers and consumers, and balances productivity and consumption, because they can not communicate directly, but through queues.
Producer consumer model
Producer consumer model is to solve the strong coupling problem between producer and consumer through a container.
Producers and consumers do not communicate directly, but communicate through the blocking queue. Therefore, after producing data, producers do not need to wait for consumers to process, but directly throw it to the blocking queue. Consumers do not ask producers for data, but go to the blocking queue to find data. Blocking queues are like buffers, balancing the capabilities of producers and consumers.
Multiprocess queue implementation
The specific code is as follows:
from multiprocessing import Process, Queue import time, random from threading import Thread import queue #Producer def producer(name, food, q): for i in range(4): time.sleep(random.randint(1, 3)) #Time of simulation data generation f = '%s Produced %s %s individual' % (name, food, i + 1) print(f) q.put(f) #Consumer def consumer(name, q): while True: food = q.get() if food is None: print('%s Get an empty' % name) break f = '%s Consumption %s' % (name, food) print(f) time.sleep(random.randint(1, 3)) if __name__ == '__main__': q = Queue() #Create queue #Simulate the producer and generate data p1 = Process(target=producer, args=('p1', 'steamed stuffed bun', q)) p1.start() p2 = Process(target=producer, args=('p2', 'Clay oven rolls', q)) p2.start() c1 = Process(target=consumer, args=('c1', q)) c1.start() c2 = Process(target=consumer, args=('c2', q)) c2.start() p1.join() p2.join() q.put(None) q.put(None)
Thread queue implementation
The above code is implemented by multiple processes. Next, consider implementing this function by multiple threads.
The specific code is as follows:
import random import time from threading import Thread import queue def producer(name, count, q): for i in range(count): food = f'{name} Production section{i}A steamed stuffed bun' print(food) q.put(food) def consumer(name, q): while True: time.sleep(random.randint(1, 3)) if q.empty(): break print(f'{name} Consumption {q.get()}') if __name__ == '__main__': q = queue.Queue() print(q.empty()) for i in range(1, 4): p = Thread(target=producer, args=(f'producer{i}', 10, q)) p.start() for i in range(1, 6): c = Thread(target=consumer, args=(f'consumer{i}', q)) c.start()
Characteristics of producer consumer model
-
Ensure that producers will not continue to put data into the buffer when the buffer is full, and consumers will not consume data when the buffer is empty.
-
When the buffer is full, the producer will enter the sleep state. The next time the consumer starts consuming the buffer data, the producer will be awakened and start adding data to the buffer; When the buffer is empty, the consumer will go into sleep and will not be awakened until the producer adds data to the buffer.
Basic knowledge summary
Here, I have basically taught you the basic knowledge needed in this actual battle, which is equivalent to throwing a brick to attract jade. Throw out my brick and lead out the jade of my friends. This basic knowledge is mainly divided into two modules. The first is object-oriented knowledge and the second is thread related knowledge. Partners need to be familiar with it as much as possible in order to write a more efficient and robust crawler demo.
Actual combat chapter
Tool library usage
I'll list the tool libraries needed by this crawler first
import requests from lxml import etree import threading from queue import Queue import re
You can install what is missing.
Grab target
The website to be captured in this actual battle is doutu. The website is as follows:
https://www.doutub.com/
The content we need to capture is the Dou Tu expression package under the website.
Instantly make you a master of fighting map. Don't say anything, just do it.
Web page analysis
Take a closer look, good guy, there's a 26 page expression bag. Isn't it taking off?
First, let's analyze the address changes of different page URLs.
#First page https://www.doutub.com/img_lists/new/1 #Page 2 https://www.doutub.com/img_lists/new/2 #Page 3 https://www.doutub.com/img_lists/new/3
After seeing this way of change, don't you rejoice first.
The page url address has been completed. The next thing to find out is the url address of each expression package.
image-20210701200206882
Isn't it easy for the smart you to find out? These links can be extracted by xpath.
Producer realization
First, we create two queues, one for storing the url address of each page and the other for storing picture links.
The specific code is as follows:
#Establish queue page_queue = Queue() #Page url img_queue = Queue() #Picture url for page in range(1, 27): url = f'https://www.doutub.com/img_lists/new/{page}' page_queue.put(url)
Through the above code, the url address of each page is put into the page_queue.
Next, create a class to put the image url into img_ In the queue.
The specific code is as follows:
class ImageParse(threading.Thread): def __init__(self, page_queue, img_queue): super(ImageParse, self).__init__() self.page_queue = page_queue self.img_queue = img_queue self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' } def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() self.parse_img(url) def parse_img(self, url): response = requests.get(url, headers=self.headers).content.decode('utf-8') html = etree.HTML(response) img_lists = html.xpath('//div[@class="expression-list clearfix"]') for img_list in img_lists: img_urls = img_list.xpath('./div/a/img/@src') img_names = img_list.xpath('./div/a/span/text()') for img_url, img_name in zip(img_urls, img_names): self.img_queue.put((img_url, img_name))
Consumer realization
In fact, consumers are very simple. We just need to continue from img_ Get the url link of the picture in page and visit it continuously. You can exit until one of the two queues is empty.
class DownLoad(threading.Thread): def __init__(self, page_queue, img_queue): super(DownLoad, self).__init__() self.page_queue = page_queue self.img_queue = img_queue self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' } def run(self): while True: if self.page_queue.empty() and self.img_queue.empty(): break img_url, filename = self.img_queue.get() fix = img_url.split('.')[-1] name = re.sub(r'[??.,. !!*\\/|]', '', filename) # print(fix) data = requests.get(img_url, headers=self.headers).content print('Downloading' + filename) with open('../image/' + name + '.' + fix, 'wb') as f: f.write(data)
Finally, let the two created threads run
for x in range(5): t1 = ImageParse(page_queue, img_queue) t1.start() t2 = DownLoad(page_queue, img_queue) t2.start() t1.join() t2.join()
Final results
A total of 1269 pictures were captured.
Who can match you from now on? That's it? There are no reptiles!