Introduction to Python Crawler [5]: 27270 Picture Crawling

Posted by Jove on Thu, 25 Jul 2019 12:09:31 +0200

Today, continue to crawl a website, http://www.27270.com/ent/meinvtupian/This website has anti-crawling, so some of the code we downloaded is not very well handled, you focus on learning ideas, what suggestions can be commented on in the place to tell me.

For the future direction of network request operation, we simply encapsulate some code this time.

Here you can install a module called retrying first.

pip install retrying

The specific use of this module, go to Baidu. Hey hey.

Here I use a random method to generate USER_AGENT.


import requests
from retrying import retry
import random
import datetime

class R:

    def __init__(self,method="get",params=None,headers=None,cookies=None):
        # do something

    def get_headers(self):
        user_agent_list = [ \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        UserAgent = random.choice(user_agent_list)
        headers = {'User-Agent': UserAgent}
        return headers
    #other code
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

The easiest way to retrying is to add a decorator to the way you want to try again and again. @retry

Here, I want the network request module to try three times and report an error!

Adding some necessary parameters to the simultaneous R-class initialization method, you can look directly at the following code

_ The retrying_requests method is a private method, which is judged logically according to the get adverbial clause: post mode.


import requests
from retrying import retry
import random
import datetime

class R:

    def __init__(self,method="get",params=None,headers=None,cookies=None):
        #do something

    def get_headers(self):
        # do something
    @retry(stop_max_attempt_number=3)
    def __retrying_requests(self,url):
        if self.__method == "get":
            response = requests.get(url,headers=self.__headers,cookies=self.__cookies,timeout=3)
        else:
            response = requests.post(url,params=self.__params,headers=self.__headers,cookies=self.__cookies,timeout=3)
        return response.content

    # other code

The method of network request has been declared, report view and response.content data stream has been declared.

Based on this private method, we add a method to get network text and a method to get network files. Synchronized perfection of class initialization methods, in the development found that to crawl the | page coding English gb2312, so we also need to add a coding parameter to some methods

import requests
from retrying import retry
import random
import datetime

class R:
    # Class initialization method
    def __init__(self,method="get",params=None,headers=None,cookies=None):
        self.__method = method
        myheaders = self.get_headers()
        if headers is not None:
            myheaders.update(headers)
        self.__headers = myheaders
        self.__cookies = cookies
        self.__params = params

    def get_headers(self):
       # do something

    @retry(stop_max_attempt_number=3)
    def __retrying_requests(self,url):
        # do something

    # get request
    def get_content(self,url,charset="utf-8"):
        try:
            html_str = self.__retrying_requests(url).decode(charset)
        except:
            html_str = None
        return html_str

    def get_file(self,file_url):
        try:
            file = self.__retrying_requests(file_url)
        except:
            file = None
        return file

At this point, this R class has been perfected by us, complete code, you should piece together from above, you can also turn to the end of the article, go to github directly consult.

Next, there's the more important part of the crawler code. This time, we can simply use classes and objects, and add simple multithreading operations.

First, a class called "ImageList" is created. The first thing about this class is to get the total number of pages we crawled.

This step is relatively simple.

  1. Getting Web Source Code
  2. Regular matching of last page elements
  3. Extraction of Numbers
import http_help as hh   # This http_help is the R class I wrote about above.
import re
import threading
import time
import os
import requests

# Get a list of all URL s to crawl
class ImageList():
    def __init__(self):
        self.__start = "http://Www.27270.com/ent/meinvtupian/list_11_{}.html " URL template"
        # header file
        self.__headers = {"Referer":"http://www.27270.com/ent/meinvtupian/",
                          "Host":"www.27270.com"
                          }
        self.__res = hh.R(headers=self.__headers)  # Initialize access request
    def run(self):
        page_count =  int(self.get_page_count())

        if page_count==0:
            return
        urls = [self.__start.format(i) for i in range(1,page_count)]
        return urls

    # Regular expressions match the last page and analyze the page number
    def get_page_count(self):
        # Note that this place requires incoming coding
        content = self.__res.get_content(self.__start.format("1"),"gb2312")
        pattern = re.compile("<li><a href='list_11_(\d+?).html' target='_self'>Last</a></li>")
        search_text = pattern.search(content)
        if search_text is not None:
            count = search_text.group(1)
            return count
        else:
            return 0
if __name__ == '__main__':
    img = ImageList()
    urls = img.run()

Notice the get_page_count method in the above code, which has obtained the page number at the end.

Inside our run method, through a list generator

urls = [self.__start.format(i) for i in range(1,page_count)]

All links to be crawled are generated in batches.

27270 Pictures - Analyse the list of URL s crawled above and capture the details page

We use the producer and consumer model, that is, to grab the link picture, a download picture, using a multi-threaded way to operate, need to be introduced first.

import threading
import time

The complete code is as follows

import http_help as hh
import re
import threading
import time
import os
import requests

urls_lock = threading.Lock()  #url operation lock
imgs_lock = threading.Lock()  #Picture Operation Lock

imgs_start_urls = []

class Product(threading.Thread):
    # Class initialization method
    def __init__(self,urls):
        threading.Thread.__init__(self)
        self.__urls = urls
        self.__headers = {"Referer":"http://www.27270.com/ent/meinvtupian/",
                          "Host":"www.27270.com"
                          }

        self.__res = hh.R(headers=self.__headers)

    # Re-join the urls list after link crawling fails
    def add_fail_url(self,url):
        print("{}this URL Grab Failure".format(url))
        global urls_lock
        if urls_lock.acquire():
            self.__urls.insert(0, url)
            urls_lock.release()  # Unlock

    # Main Thread Methods
    def run(self):
        print("*"*100)
        while True:
            global urls_lock,imgs_start_urls
            if len(self.__urls)>0:
                if urls_lock.acquire():   # locking
                    last_url = self.__urls.pop()   # Get the last url in the urls and delete it
                    urls_lock.release()  # Unlock

                print("Operating{}".format(last_url))

                content = self.__res.get_content(last_url,"gb2312")   # Page Attention Coding is an error in other formats of gb2312
                if content is not  None:
                    html = self.get_page_list(content)

                    if len(html) == 0:
                        self.add_fail_url(last_url)
                    else:
                        if imgs_lock.acquire():
                            imgs_start_urls.extend(html)    # After crawling to the picture, place it in the list of pictures to download.
                            imgs_lock.release()

                    time.sleep(5)
                else:
                    self.add_fail_url(last_url)

            else:
                print("All links have been run.")
                break

    def get_page_list(self,content):
        # regular expression
        pattern = re.compile('<li> <a href="(.*?)" title="(.*?)" class="MMPic" target="_blank">.*?</li>')
        list_page = re.findall(pattern, content)

        return list_page
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

Some of the more important ones in the above code are The use of threading.Lock () locks to manipulate global variables among threads requires timely locking. Other notes I've added to the notes can be worked out as long as you follow the steps and write a little bit, and add some of your own subtle understandings.

So far, we've grabbed all the picture addresses and stored them in a global variable called imgs_start_urls. So here we go again

This list contains addresses like http://www.27270.com/ent/meinvtupian/2018/298392.html. When you open this page, you will find only one picture and a pagination below.

After clicking on the pagination, you know the rules.

http://www.27270.com/ent/meinvtupian/2018/298392.html 
http://www.27270.com/ent/meinvtupian/2018/298392_2.html 
http://www.27270.com/ent/meinvtupian/2018/298392_3.html 
http://www.27270.com/ent/meinvtupian/2018/298392_4.html 
....

When you try many times, you will find that the links behind can be completed by splicing. If there is no page, it will show?

Well, if you do the above, you should know how to do it next.

I will paste all the code directly below, or annotate the most important place for you.

class Consumer(threading.Thread):
    # Initialization
    def __init__(self):
        threading.Thread.__init__(self)
        self.__headers = {"Referer": "http://www.27270.com/ent/meinvtupian/",
                          "Host": "www.27270.com"}
        self.__res = hh.R(headers=self.__headers)

    # Picture downloading method
    def download_img(self,filder,img_down_url,filename):
        file_path = "./downs/{}".format(filder)

        # Determine whether a directory exists, exists, and creates
        if not os.path.exists(file_path):
            os.mkdir(file_path)  # Create directories

        if os.path.exists("./downs/{}/{}".format(filder,filename)):
            return
        else:
            try:
                # This place host ing is a pit, because pictures are stored on another server to prevent chain theft.
                img = requests.get(img_down_url,headers={"Host":"t2.hddhhn.com"},timeout=3)
            except Exception as e:
                print(e)

            print("{}Write pictures".format(img_down_url))
            try:
                # Writing pictures is not redundant
                with open("./downs/{}/{}".format(filder,filename),"wb+") as f:
                    f.write(img.content)
            except Exception as e:
                print(e)
                return

    def run(self):

        while True:
            global imgs_start_urls,imgs_lock

            if len(imgs_start_urls)>0:
                if imgs_lock.acquire():  # locking
                    img_url = imgs_start_urls[0]   #After getting the link
                    del imgs_start_urls[0]  # Delete item 0
                    imgs_lock.release()  # Unlock
            else:
                continue

            # http://www.27270.com/ent/meinvtupian/2018/295631_1.html

            #print("Pictures begin to download")
            img_url = img_url[0]
            start_index = 1
            base_url = img_url[0:img_url.rindex(".")]    # Strings can be sliced as lists

            while True:

                img_url ="{}_{}.html".format(base_url,start_index)   # url splicing
                content = self.__res.get_content(img_url,charset="gbk")   # This place gets content, using gbk coding
                if content is not None:
                    pattern = re.compile('<div class="articleV4Body" id="picBody">[\s\S.]*?img alt="(.*?)".*? src="(.*?)" />')
                    # Matching pictures, if not matched, means that the operation has been completed.
                    img_down_url = pattern.search(content)  # Get the picture address

                    if img_down_url is not None:
                        filder = img_down_url.group(1)
                        img_down_url = img_down_url.group(2)
                        filename = img_down_url[img_down_url.rindex("/")+1:]
                        self.download_img(filder,img_down_url,filename)  #Download pictures

                    else:
                        print("-"*100)
                        print(content)
                        break # Terminating circulatory body

                else:
                    print("{}Link Loading Failed".format(img_url))

                    if imgs_lock.acquire():  # locking
                        imgs_start_urls.append(img_url)
                        imgs_lock.release()  # Unlock

                start_index+=1   # In the description above, this place needs to be continuously + 1 operated.

All the codes are on it. I tried to add annotations to the key points. You can look at them carefully. If you don't understand them, you can knock them several times more. Because there is no special complexity, many of them are logic.

Finally, we attach the main part of the code to let our code run.


if __name__ == '__main__':

    img = ImageList()
    urls = img.run()
    for i in range(1,2):
        p = Product(urls)
        p.start()

    for i in range(1,2):
        c = Consumer()
        c.start()
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

After a while, take your time.

Topics: Programming Windows Python network Linux