To avoid awkward conversation, I crawled more than 1000 battle maps in Python

Posted by biznickman on Mon, 20 Dec 2021 23:53:32 +0100

A few days ago, I was really embarrassed when chatting in the company group. Because there were not enough doutu expression packs, the whole chat atmosphere could not be driven, so I was depressed and frustrated!

In order to enliven the atmosphere, I climbed more than 1000 doutu expression packs.

Considering that some partners may not have a good foundation in python, gnaw Shujun decided to help you supplement the basic knowledge first, and the boss can read the actual combat content directly. The actual combat content of this time is to climb: bucket map.

If you don't want to see these basic knowledge, you can directly pull to the actual combat article at the end of the article.

object-oriented

python is an object-oriented language from the beginning of design, so it is very simple to use python to create a class and object.

If you haven't been exposed to object-oriented programming language before, you need to understand some basic features of object-oriented language. Next, let's feel the object-oriented language of python.

Introduction to object oriented

  • Class: used to describe a collection of objects with the same properties and methods. It defines the properties and methods common to each object in the collection. An object is an instance of a class.

  • Class variables: class variables are public in the entire instantiated object. Class variables are defined in the class and outside the function.

  • Data member: class variable or instance variable, which is used to process the relevant data of the class and its instance object.

  • Method overloading: if the method inherited from the parent class cannot meet the needs of the child class, it can be rewritten. This process is called overriding, also known as method overloading.

  • Instance variable: the variable defined in the method, which only works on the class of the current instance.

  • Inheritance: that is, a derived class inherits the fields and methods of the base class (parent class).

  • Instantiation: create an instance of a class's concrete object.

  • Methods: functions defined in classes

  • Object: an instance of a data structure defined by a class. The object consists of two data members (class variables and instance variables) and methods.

Create classes and objects

Class is equivalent to a template. There can be multiple functions in the template. Functions are used to implement functions.

Object is actually an instance created according to the template. The created instance can execute the functions in the class.

#Create class
class Foo(object):
    #Create a function in a class
    def bar(self):
        # todo
        pass
 #Create obj object according to Foo class
 obj = Foo()
  • Class is a keyword that represents a class

  • The object code is the parent class, and all classes inherit the object class

  • Create an object and add parentheses after the class name

Three characteristics of object oriented

encapsulation

Encapsulation, as the name suggests, is to encapsulate the content somewhere, and then call the content encapsulated somewhere.

Therefore, when using object-oriented encapsulation features, you need to:

  • Encapsulate content somewhere

  • Call the encapsulated content from somewhere

class Foo(object):
    #Construction method, which is automatically executed when creating objects based on classes
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
#Create objects from class Foo
#Automatically start the of Foo class__ init__ method
obj1 = Foo('Jack', 18)
obj2 = Fo('Rose', 20)

obj1 = Foo('Jack', 18) encapsulates Jack and 18 into the name and age attributes of obj1(self), and obj2 is the same.

Self is a formal parameter. When obj1 = Foo('Jack', 18) is executed, self is equal to obj1. Therefore, each object has name and age attributes.

Encapsulating content through object calls

class Foo(object):
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
obj1 = Foo('Jack', 18)
print(obj1.name) #Call the name attribute of obj1
print(obj1.age)  #Call the age attribute of obj1

obj2 = Foo('Jack', 18)
print(obj2.name) #Call the name attribute of obj2
print(obj2.age)  #Call the age attribute of obj2

The encapsulated content is called indirectly through self

class Foo(object):
    def __init__(self, name, age):
        self.name = name
        self.age = age
    def detail(self):
        print(self.name)
        print(self.age)

obj1 = Foo('Jack', 18)
obj1.detail()

obj2 = Foo('Rose', 20)
obj2.detail()

obj1.detail()python will pass obj1 to the self parameter by default, that is, obj1 Detail (obj1), so self=obj1 in the method, that is, self Name is equivalent to obj1 name.

Feel the simplicity of object-oriented

For object-oriented encapsulation, it is actually to use the construction method to encapsulate the content into the object, and then obtain the encapsulated content directly or indirectly through the object.

Next, let's experience the simplicity of object-oriented.

def kanchai(name, age, gender):
    print "%s,%s year,%s,Go up the mountain to cut firewood" %(name, age, gender)


def qudongbei(name, age, gender):
    print "%s,%s year,%s,Drive to the northeast" %(name, age, gender)


def dabaojian(name, age, gender):
    print "%s,%s year,%s,Love big health care" %(name, age, gender)


kanchai('Xiao Ming', 10, 'male')
qudongbei('Xiao Ming', 10, 'male')
dabaojian('Xiao Ming', 10, 'male')


kanchai('Lao Li', 90, 'male')
qudongbei('Lao Li', 90, 'male')
dabaojian('Lao Li', 90, 'male')
Functional programming
 
class Foo(object):
    
    def __init__(self, name, age ,gender):
        self.name = name
        self.age = age
        self.gender = gender

    def kanchai(self):
        print "%s,%s year,%s,Go up the mountain to cut firewood" %(self.name, self.age, self.gender)

    def qudongbei(self):
        print "%s,%s year,%s,Drive to the northeast" %(self.name, self.age, self.gender)

    def dabaojian(self):
        print "%s,%s year,%s,Love big health care" %(self.name, self.age, self.gender)


xiaoming = Foo('Xiao Ming', 10, 'male')
xiaoming.kanchai()
xiaoming.qudongbei()
xiaoming.dabaojian()

laoli = Foo('Lao Li', 90, 'male')
laoli.kanchai()
laoli.qudongbei()
laoli.dabaojian()

If you use functional programming, you need to pass in the same parameters every time you execute a function. If there are many parameters, you have to copy and paste them every time, which is very inconvenient; For object-oriented, you only need to encapsulate the required parameters into the object when creating the object, and then get the encapsulated content through object call.

inherit

Inheritance means that there is a parent-child relationship between classes. Subclasses can directly access the static properties and methods of the parent class. In python, the newly created class can inherit one or more parent classes. The parent class can be called base class or super class, and the newly created class is called derived class or subclass.

class ParentClass1: #Define parent class 1
    pass
class ParentClass2: #Define parent class 2
    pass
class SubClass1(ParentClass1):
    #Single inheritance. The base class is ParentClass1 and the derived class is SubClass
    pass
class SubClass2(ParentClass1,ParentClass2):
    #python supports multiple inheritance, separating multiple inherited classes with commas
    pass

print(SubClass1.__bases__)  #View all inherited parent classes
print(SubClass2.__bases__)
# ===============
# (<class '__main__.Father1'>,)
# (<class '__main__.Father1'>, <class '__main__.Father2'>)

Inherited rules

1. The subclass inherits the member variables and methods of the parent class

2. Subclasses do not inherit the constructor of the parent class

3. Subclasses cannot delete parent class members, but they can redefine parent class members

4. Subclasses can add their own members.

The specific code is as follows:

class Person(object):
 def __init__(self, name, age, sex):
  self.name = 'jasn'
  self.age = 18
  self.sex = sex

 def talk(self):
  print('I want to say someting to you')


class Chinese(Person):
 def __init__(self, name, age, sex, language):
  Person.__init__(self, name, age, sex) #Override the attributes of the subclass with the name, age and sex of the parent class
  self.age = age #Override the age attribute of the parent class and take the value as the age parameter passed in by the child class instance
  self.language = 'Chinese'

 def talk(self):
  print('I speak Mandarin')
  Person.talk(self)


obj = Chinese('nancy', 30, 'male', 'mandarin')
print(obj.name)
print(obj.age)
print(obj.language)
obj.talk()

The operation results are as follows:

jasn
30
Chinese
 I speak Mandarin
I want to say someting to you

Because the Chinese class overrides the Person class, at the beginning, we override the attributes of the parent class over the attributes of the child class, such as the name attribute. The child class does not override the parent class. Therefore, even if the child class passes the name attribute value, the name attribute of the parent class is still output.

The role of inheritance

1. Realize code (function) reuse and reduce code redundancy

2. Enhance software scalability

3. Improve software maintainability

Inheritance and abstract concepts

Two important concepts of object-oriented: abstraction and classification.

class animal():   #Define parent class
    country = 'china'     #This is called a class variable
    def __init__(self,name,age):
        self.name = name   #These are also called data attributes
        self.age = age

    def walk(self):         #Class functions, methods, dynamic properties
        print('%s is walking'%self.name)

    def say(self):
        pass

class people(animal): #The subclass inherits the parent class
    pass
    
class pig(animal): #The subclass inherits the parent class
    pass

class dog(animal): #The subclass inherits the parent class
    pass

aobama=people('aobama',60)   #Instantiate an object
print(aobama.name)
aobama.walk()

The above code can be understood as follows: we abstract human, dog and pig as animals, and human, dog and pig inherit animal classes.

The function and principle of super() in python

super() is very commonly used in class inheritance. It solves some problems of subclasses calling parent methods. Let's take a look at what it optimizes.

class Foo(object):
 def bar(self, message):
  print(message)


obj1 = Foo()
obj1.bar('hello')

When there is an inheritance relationship, it is sometimes necessary to call the parent class method in the subclass. At this point, the simplest way is to transform the object call into the class call. It is important to note that the self parameter needs to be displayed.

The specific code is as follows:

class FooParent(object):
 """docstring for FooParent"""
 def bar(self, message):
  print(message)

class FooChild(FooParent):
 """docstring for FooChild"""
 def bar(self, message):
  FooParent.bar(self, message)



foochild = FooChild()
foochild.bar('hello')

This inheritance method is actually flawed. For example, if I modify the name of the parent class, many modifications will be involved in the child class.

Therefore, python introduces the super() mechanism. The specific code is as follows:

class FooParent(object):
 def bar(self, message):
  print(message)


class FooChild(FooParent):
 def bar(self, message):
  super(FooChild, self).bar(message)


obj = FooChild()
obj.bar('hello')

polymorphic

The knowledge of python polymorphism is not used in the actual combat content, so I won't describe it any more. Partners can find out the information by themselves.

What is the producer and consumer model

For example, there are two processes A and B, which share A fixed size buffer, and the production data of process A is put into the buffer; Process B fetches data from the buffer for calculation, so process A here is equivalent to the producer and process B is equivalent to the consumer.

Why use the producer consumer model

In the process world, the producer is the process of producing data, and the consumer is the process of using (processing) data. Similarly, if the processing capacity of the consumer is greater than that of the producer, the consumer must wait for the producer. Similarly, if the processing capacity of the producer is greater than that of the consumer, the producer must wait for the consumer.

It realizes the decoupling and between producers and consumers, and balances productivity and consumption, because they can not communicate directly, but through queues.

Producer consumer model

Producer consumer model is to solve the strong coupling problem between producer and consumer through a container.

Producers and consumers do not communicate directly, but communicate through the blocking queue. Therefore, after producing data, producers do not need to wait for consumers to process, but directly throw it to the blocking queue. Consumers do not ask producers for data, but go to the blocking queue to find data. Blocking queues are like buffers, balancing the capabilities of producers and consumers.

Multiprocess queue implementation

The specific code is as follows:

from multiprocessing import Process, Queue
import time, random
from threading import Thread
import queue


#Producer
def producer(name, food, q):
    for i in range(4):
        time.sleep(random.randint(1, 3))  #Time of simulation data generation
        f = '%s Produced %s %s individual' % (name, food, i + 1)
        print(f)
        q.put(f)


#Consumer
def consumer(name, q):
    while True:
        food = q.get()
        if food is None:
            print('%s Get an empty' % name)
            break
        f = '%s Consumption %s' % (name, food)
        print(f)
        time.sleep(random.randint(1, 3))


if __name__ == '__main__':
    q = Queue()  #Create queue
    #Simulate the producer and generate data
    p1 = Process(target=producer, args=('p1', 'steamed stuffed bun', q))
    p1.start()
    p2 = Process(target=producer, args=('p2', 'Clay oven rolls', q))
    p2.start()

    c1 = Process(target=consumer, args=('c1', q))
    c1.start()
    c2 = Process(target=consumer, args=('c2', q))
    c2.start()

    p1.join()
    p2.join()
    q.put(None)
    q.put(None)

Thread queue implementation

The above code is implemented by multiple processes. Next, consider implementing this function by multiple threads.

The specific code is as follows:

import random
import time
from threading import Thread
import queue


def producer(name, count, q):
    for i in range(count):
        food = f'{name} Production section{i}A steamed stuffed bun'
        print(food)
        q.put(food)


def consumer(name, q):
    while True:
        time.sleep(random.randint(1, 3))
        if q.empty():
            break
        print(f'{name} Consumption {q.get()}')


if __name__ == '__main__':
    q = queue.Queue()
    print(q.empty())
    for i in range(1, 4):
        p = Thread(target=producer, args=(f'producer{i}', 10, q))
        p.start()

    for i in range(1, 6):
        c = Thread(target=consumer, args=(f'consumer{i}', q))
        c.start()

Characteristics of producer consumer model

  • Ensure that producers will not continue to put data into the buffer when the buffer is full, and consumers will not consume data when the buffer is empty.

  • When the buffer is full, the producer will enter the sleep state. The next time the consumer starts consuming the buffer data, the producer will be awakened and start adding data to the buffer; When the buffer is empty, the consumer will go into sleep and will not be awakened until the producer adds data to the buffer.

Basic knowledge summary

Here, I have basically taught you the basic knowledge needed in this actual battle, which is equivalent to throwing a brick to attract jade. Throw out my brick and lead out the jade of my friends. This basic knowledge is mainly divided into two modules. The first is object-oriented knowledge and the second is thread related knowledge. Partners need to be familiar with it as much as possible in order to write a more efficient and robust crawler demo.

Actual combat chapter

Tool library usage

I'll list the tool libraries needed by this crawler first

import requests
from lxml import etree
import threading
from queue import Queue
import re

You can install what is missing.

Grab target

The website to be captured in this actual battle is doutu. The website is as follows:

https://www.doutub.com/

The content we need to capture is the Dou Tu expression package under the website.

Instantly make you a master of fighting map. Don't say anything, just do it.

Web page analysis

Take a closer look, good guy, there's a 26 page expression bag. Isn't it taking off?

First, let's analyze the address changes of different page URLs.

#First page
https://www.doutub.com/img_lists/new/1

#Page 2
https://www.doutub.com/img_lists/new/2

#Page 3
https://www.doutub.com/img_lists/new/3

After seeing this way of change, don't you rejoice first.

The page url address has been completed. The next thing to find out is the url address of each expression package.

image-20210701200206882

Isn't it easy for the smart you to find out? These links can be extracted by xpath.

Producer realization

First, we create two queues, one for storing the url address of each page and the other for storing picture links.

The specific code is as follows:

   #Establish queue
    page_queue = Queue()    #Page url
    img_queue = Queue()     #Picture url
    for page in range(1, 27):
        url = f'https://www.doutub.com/img_lists/new/{page}'
        page_queue.put(url)

Through the above code, the url address of each page is put into the page_queue.

Next, create a class to put the image url into img_ In the queue.

The specific code is as follows:

class ImageParse(threading.Thread):
    def __init__(self, page_queue, img_queue):
        super(ImageParse, self).__init__()
        self.page_queue = page_queue
        self.img_queue = img_queue
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
        }

    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.parse_img(url)

    def parse_img(self, url):
        response = requests.get(url, headers=self.headers).content.decode('utf-8')
        html = etree.HTML(response)
        img_lists = html.xpath('//div[@class="expression-list clearfix"]')
        for img_list in img_lists:
            img_urls = img_list.xpath('./div/a/img/@src')
            img_names = img_list.xpath('./div/a/span/text()')
            for img_url, img_name in zip(img_urls, img_names):
                self.img_queue.put((img_url, img_name))

Consumer realization

In fact, consumers are very simple. We just need to continue from img_ Get the url link of the picture in page and visit it continuously. You can exit until one of the two queues is empty.

class DownLoad(threading.Thread):
    def __init__(self, page_queue, img_queue):
        super(DownLoad, self).__init__()
        self.page_queue = page_queue
        self.img_queue = img_queue
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
        }

    def run(self):
        while True:
            if self.page_queue.empty() and self.img_queue.empty():
                break
            img_url, filename = self.img_queue.get()
            fix = img_url.split('.')[-1]
            name = re.sub(r'[??.,. !!*\\/|]', '', filename)
            # print(fix)
            data = requests.get(img_url, headers=self.headers).content
            print('Downloading' + filename)
            with open('../image/' + name + '.' + fix, 'wb') as f:
                f.write(data)

Finally, let the two created threads run

    for x in range(5):
        t1 = ImageParse(page_queue, img_queue)
        t1.start()
        t2 = DownLoad(page_queue, img_queue)
        t2.start()
        t1.join()
        t2.join()

Final results

A total of 1269 pictures were captured.

Who can match you from now on? That's it? There are no reptiles!

Topics: Python Back-end