This is probably the most tedious Python crawler introductory tutorial 2-100

Posted by muthuraj_mr on Tue, 14 May 2019 19:03:06 +0200

Preface

Starting today, I'm going to roll up my sleeve and write Python crawlers directly. The best way to learn a language is to do it purposefully, so I'm going to do this with 10 + blogs.Hope you can do it well.

In order to write the crawler well, we need to prepare a Firefox browser, also need to prepare the crawler tool, the crawler tool, I use the tcpdump that comes with CentOS, plus wireshark, these two software installation and use, I suggest you still learn, we should use them later.

Network request module requests

The large number of open source modules in Python makes coding very simple, and the first module we want the crawler to understand is requests.

Install requests

Open Terminal: Use Command

pip3 install requests

Waiting for installation to complete

Next, type the following command in the terminal

# mkdir demo  
# cd demo
# touch down.py

The linux command above creates a folder named demo, then a down.py file. You can also use the GUI tool to create various files by right-clicking, just like you do with windows.

To improve the efficiency of development on linux, we need to install a visual studio code development tool

For how to install vscode, refer to the official https://code.visualstudio.com/docs/setup/linux Detailed instructions are provided.

For centos, the following are true:

sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc
sudo sh -c 'echo -e "[code]\nname=Visual Studio Code\nbaseurl=https://packages.microsoft.com/yumrepos/vscode\nenabled=1\ngpgcheck=1\ngpgkey=https://packages.microsoft.com/keys/microsoft.asc" > /etc/yum.repos.d/vscode.repo'

Then install with the yum command

yum check-update
sudo yum install code

After successful installation, the following image will appear in your CentOS

Next, let's talk about what we did above, because we have a gnome graphical interface here, so there are some operations that follow, and I'll explain them directly in the style of windows

Open Software > File > Open File > Find the down.py file we just created

After that, enter it in VSCODE

import requests   #Import Module

def run():        #Declare a run method
    print("Running code file")    #print contents

if __name__ == "__main__":   #Main Program Entry
    run()    #Call the run method above

tips: This tutorial is not an introductory Python 3 course, so there are some coding basics that you understand by default, such as Python without a semicolon end and need to align formats.I will try to write the comment as completely as possible

Press ctrl+s on the keyboard to save the file, if prompt permission is insufficient, then enter the password as prompted

Enter the demo directory through the terminal and enter

python3 down.py

Display the following results to indicate that compilation is OK

[root@bogon demo]# python3 down.py
 Running code file

Next, let's start testing if the requests module can be used

Modify the

import requests

def run():
    response = requests.get("http://www.baidu.com")
    print(response.text)

if __name__ == "__main__":
    run()

Run results (the following image shows you successfully run):

Next, let's actually try downloading a picture, like the one below

Modify the code, before that we'll modify something

Since every time you modify a file, you are prompted to have administrator privileges, so you can use the linux command to modify privileges.

[root@bogon linuxboy]# chmod -R 777 demo/

import requests

def run():
    response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg") 
    with open("xijinping.jpg","wb") as f :
        f.write(response.content)   
        f.close

if __name__ == "__main__":
    run()

After running the code, a file was found inside the folder

But when I opened the file, I found it was not accessible, which means it was not downloaded at all.

We continue to modify the code because there are some restrictions on the server pictures that we can open with the browser, but the Python code can't be downloaded completely.

Modify Code

import requests

def run():
    # Header file, header is a dictionary type
    headers = {
        "Host":"www.newsimg.cn",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5383.400 QQBrowser/10.0.1313.400"
    }
    response = requests.get("http://www.newsimg.cn/big201710leaderreports/xibdj20171030.jpg",headers=headers) 
    with open("xijinping.jpg","wb") as f :
        f.write(response.content)   
        f.close

if __name__ == "__main__":
    run()

Okay, compile the python file at the terminal this time

python3 down.py

Picture downloaded

We focused on the requests.get section of the code above, adding a headers argument.So our program downloads the complete picture.

Python Crawler Page Analysis

With this simple case above, the next step will be much simpler.How do crawls work?

Enter Domain Name - > Download Source Code - > Analysis Picture Path - > Download Picture

That's his step above

enter field name

The website we're going to crawl today is called http://www.meizitu.com/a/pure.html

Why crawl this site because it's crawlable.

Okay, let's go ahead and analyze this page

One important thing about being a crawler is that you need to find where to page because paging represents a regular, regular pattern, so we can crawl (you can do it smarter, enter the home page address, and the crawler can analyze all the addresses in this site by itself)

In the picture above, we found a page break, so find a rule

Discover Paging Rules Using Developer Tools for Firefox Browser

http://www.meizitu.com/a/pure_1.html
http://www.meizitu.com/a/pure_2.html
http://www.meizitu.com/a/pure_3.html
http://www.meizitu.com/a/pure_4.html

Okay, then use Python to implement this part (there are some object-oriented writing, no basic students, please Baidu to find some basic, but for you to learn, these are very simple)

import requests
all_urls = []  #Our stitched galleries and list paths
class Spider():
    #Constructor to initialize data use
    def __init__(self,target_url,headers):
        self.target_url = target_url
        self.headers = headers

    #Get all the URL s you want to grab
    def getUrls(self,start_page,page_num):

        global all_urls
        #Loop to get URL s
        for i in range(start_page,page_num+1):
            url = self.target_url  % i
            all_urls.append(url)

if __name__ == "__main__":
    headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
    }
    target_url = 'http://www.meizitu.com/a/pure_%d.html' #Picture sets and list rules

    spider = Spider(target_url,headers)
    spider.getUrls(1,16)
    print(all_urls)

The code above may require some Python basics to understand, but if you look closely, there are a few key points

The first is class Spider(): We declare a class, and then we use def u init_ to declare a constructor, which I think you'll learn in 30 minutes by looking for a tutorial.

There are many ways to stitch URL s, and I'm using the most direct, string stitching here.

Notice that there is a global variable all_urls in the code above that I use to store all our paging URL s

Next, it's the core part of the crawler code

We need to analyze the logic on the page.Open first http://www.meizitu.com/a/pure_1.html , right-click Review Elements.

Discover links in the red box above

After clicking on the pictures, I find I go to a picture details page and find that there is actually a group of pictures. Now the problem is

We need to solve the first step http://www.meizitu.com/a/pure_1.html Crawl all of these pages http://www.meizitu.com/a/5585.html This address

Here we use multithreaded crawling (there is also a design pattern called the observer pattern)

import threading   #Multithreaded module
import re #regular expression module
import time #Time Module

First, three modules are introduced, multithreaded, regular expression, and time module.

Add a new global variable, and since it is a multithreaded operation, we need to introduce a thread lock

all_img_urls = []       #Array of Picture List Pages

g_lock = threading.Lock()  #Initialize a lock

Declare a producer's class that is used to constantly get the picture details page address and add it to the global variable all_img_urls

#Producer, responsible for extracting picture list links from each page
class Producer(threading.Thread):   

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
        }
        global all_urls
        while len(all_urls) > 0 :
            g_lock.acquire()  #Lock mechanism is required when accessing all_urls
            page_url = all_urls.pop()   #Remove the last element through the pop method and return the value

            g_lock.release() #Release locks immediately after use to facilitate use by other threads
            try:
                print("Analysis"+page_url)   
                response = requests.get(page_url , headers = headers,timeout=3)
                all_pic_link = re.findall('<a target=\'_blank\' href="(.*?)">',response.text,re.S)   
                global all_img_urls
                g_lock.acquire()   #There is also a lock here
                all_img_urls += all_pic_link   #Notice the splicing of arrays here. It's a new syntax for python to use += without append.
                print(all_img_urls)
                g_lock.release()   #Release lock
                time.sleep(0.5)
            except:
                pass

The code above uses the concept of inheritance, and I inherit a subclass from threading.Thread, which you can flip over http://www.runoob.com/python3/python3-class.html The novice bird tutorial is OK.

Thread lock, in the code above, when we operate all_urls.pop(), we don't want other threads to operate on it at the same time, otherwise there will be an accident, so we use g_lock.acquire() to lock the resource, and then after using it, remember to immediately release g_lock.release(), otherwise the resource will always be occupied and the program cannot proceed.Yes.

Match URL s in Web pages, I use regular expressions, and later we'll use other methods to match.

The re.findall() method is to get all the matches, regular expressions, and you can find a 30-minute introductory tutorial and see if that's OK.

Where the code goes wrong, I put it

try: except: inside, of course, you can also customize errors.

If the above code is OK, then we can write it at the program entry

for x in range(2):
    t = Producer()
    t.start()

Execute the program because our roducer inherits from the threading.Thread class, so one of the ways you have to do this is def run, which I believe you've seen in the code above.And then we can do it ~~

Run result:

In this way, the list of picture details pages is already stored by us.

Next, we need to do this. I want to wait until the picture details page is all available for the next analysis.

Add code here

#threads= []   
#Open two threads to access
for x in range(2):
    t = Producer()
    t.start()
    #threads.append(t)

# for tt in threads:
#     tt.join()

print("Come to me")

Comment on the key code and run as follows

[linuxboy@bogon demo]$ python3 down.py
 Analyze http://www.meizitu.com/a/pure_2.html
 Analyze http://www.meizitu.com/a/pure_1.html
 Come to me
['http://www.meizitu.com/a/5585.html',

Open the above tt.join and other code comments

[linuxboy@bogon demo]$ python3 down.py
 Analyze http://www.meizitu.com/a/pure_2.html
 Analyze http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5429.html', ......
Come to me

An essential difference is that we are multithreaded, so when the program runs, print("Going to Me") doesn't wait for the other threads to finish, it runs, but when we transform into the code above, adding the key code tt.join(), the main thread's code waits until the child thread finishes runningAfter that, run down.That's enough, as I just said, to get a collection of all the picture details pages first.

What a join does is thread synchronization, where the main thread enters a blocked state after it encounters a join and waits until the execution of other sub-threads is finished before the main thread continues.This person may meet frequently in the future.

Here's an array of consumer/observer pages that keep an eye on the pictures we've just captured.

Add a global variable to store the obtained picture links

pic_links = []            #Picture Address List

#Consumer
class Consumer(threading.Thread) : 
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'www.meizitu.com'
        }
        global all_img_urls   #Invoke an array of global picture details pages
        print("%s is running " % threading.current_thread)
        while len(all_img_urls) >0 : 
            g_lock.acquire()
            img_url = all_img_urls.pop()
            g_lock.release()
            try:
                response = requests.get(img_url , headers = headers )
                response.encoding='gb2312'   #Since the page code we're calling is GB2312, we need to set it up
                title = re.search('<title>(.*?) | Beauty Girl</title>',response.text).group(1)
                all_pic_src = re.findall('<img alt=.*?src="(.*?)" /><br />',response.text,re.S)

                pic_dict = {title:all_pic_src}   #python dictionary
                global pic_links
                g_lock.acquire()
                pic_links.append(pic_dict)    #Dictionary Array
                print(title+" Success")
                g_lock.release()

            except:
                pass
            time.sleep(0.5)

See no, the code above is actually very similar to what we just wrote. Later, I will modify this part of the code on github more concisely, but this is the second lesson. We have a long way to go.

Some of the more important parts of the code I've written with comments, you can refer to them directly.It's important to note that I've used two regular expressions above, one for matching titles and one for URLs of images. This title is used to create different folders later, so be careful.

#Open 10 threads to get links
for x in range(10):
    ta = Consumer()
    ta.start()

Run result:

[linuxboy@bogon demo]$ python3 down.py
 Analyze http://www.meizitu.com/a/pure_2.html
 Analyze http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5585.html', ......
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
Come to me
<function current_thread at 0x7f7caef851e0> is running 
<function current_thread at 0x7f7caef851e0> is running 
Pure and picturesque, photographer's imperial beans succeed
 Ye Zixuan, the God of man and woman, succeeded in filming a group of popular portraits recently
 The Queen of the United States (bao) Chest (ru) brings the temptation of uniform to succeed
 Happy and successful are those who open their eyes to see you every day
 Lovely girl, may the warm wind take care of purity and persistence for success
 Pure sisters like a ray of sunshine warm this winter for success...

Do you feel you're a big step closer to success?

Next, let's do the same thing we mentioned about storing pictures, or write a custom class

class DownPic(threading.Thread) :

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
            'HOST':'mm.chinasareview.com'

        }
        while True:   #  This place is written as an infinite loop to constantly monitor if the picture link array is updated
            global pic_links
            # Uplock
            g_lock.acquire()
            if len(pic_links) == 0:   #Unlock if there are no pictures
                # In any case, release the lock
                g_lock.release()
                continue
            else:
                pic = pic_links.pop()
                g_lock.release()
                # Walk through the dictionary list
                for key,values in  pic.items():
                    path=key.rstrip("\\")
                    is_exists=os.path.exists(path)
                    # Judgement Result
                    if not is_exists:
                        # Create a directory if it doesn't exist
                        # Create Directory Operation Function
                        os.makedirs(path) 

                        print (path+'Directory created successfully')

                    else:
                        # Do not create if directory exists and prompt that directory already exists
                        print(path+' directory already exists') 
                    for pic in values :
                        filename = path+"/"+pic.split('/')[-1]
                        if os.path.exists(filename):
                            continue
                        else:
                            response = requests.get(pic,headers=headers)
                            with open(filename,'wb') as f :
                                f.write(response.content)
                                f.close

After we get the link to the picture, we need to download it. The code above is to create a file directory that we got to the title before, and then create a file in the directory with the code below.

Introducing a new module involving file operations

import os  #Directory Operation Module

# Walk through the dictionary list
for key,values in  pic.items():
    path=key.rstrip("\\")
    is_exists=os.path.exists(path)
    # Judgement Result
    if not is_exists:
        # Create a directory if it doesn't exist
        # Create Directory Operation Function
        os.makedirs(path) 

        print (path+'Directory created successfully')

    else:
        # Do not create if directory exists and prompt that directory already exists
        print(path+' directory already exists') 
    for pic in values :
        filename = path+"/"+pic.split('/')[-1]
        if os.path.exists(filename):
            continue
        else:
            response = requests.get(pic,headers=headers)
            with open(filename,'wb') as f :
                f.write(response.content)
                f.close

Because of our picture link array, which holds the dictionary format of

[{"Sister 1":["http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/01.jpg","http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/02.jpg"."http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/03.jpg"]},{"Sister 2":["http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/01.jpg","http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/02.jpg"."http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/03.jpg"]},{"Sister 3":["http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/01.jpg","http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/02.jpg"."http://mm.chinasareview.com/wp-content/uploads/2016a/08/24/03.jpg"]}]

You need to loop through the first layer, get the title, create the directory, then download the pictures in the second layer. In the code, we're modifying it to add exception handling.

try:
    response = requests.get(pic,headers=headers)
    with open(filename,'wb') as f :
        f.write(response.content)
        f.close
except Exception as e:
    print(e)
    pass

Then write code in the main program

#Open 10 threads to save pictures
for x in range(10):
    down = DownPic()
    down.start()

Run result:

[linuxboy@bogon demo]$ python3 down.py
 Analyze http://www.meizitu.com/a/pure_2.html
 Analyze http://www.meizitu.com/a/pure_1.html
['http://www.meizitu.com/a/5585.html', 'http://www.meizitu.com/a/5577.html', 'http://www.meizitu.com/a/5576.html', 'http://www.meizitu.com/a/5574.html', 'http://www.meizitu.com/a/5569.html', .......
<function current_thread at 0x7fa5121f2268> is running 
<function current_thread at 0x7fa5121f2268> is running 
<function current_thread at 0x7fa5121f2268> is running 
Come to me
 Pure sisters like a ray of sunshine warm this winter for success
 Pure sisters like a ray of sunshine warm this winter catalog was created successfully
 Lovely girl, may the warm wind take care of purity and persistence for success
 Lovely girl, may warm wind care for purity and persistence create catalogue successfully
 Super beauty, pure you and blue sky complement each other for success
 Ultra-beautiful, pure catalogue created with you and blue sky
 Beautiful frozen person, successful Taekwondo girls in the snow
 The delicate eyebrows of five senses are like the success of a fairy tale princess
 Have a confident and charming smile, every day is a brilliant success
 The exquisite eyebrows of five senses are like the successful creation of the Princess Catalogue in fairy tales
 With a confident and charming smile, every day is a brilliant catalog created successfully
 Pure and picturesque, photographer's imperial beans succeed

Simultaneous occurrence under file directory

Click to open a directory

Okay, today's simple crawl is

Finally, we write it at the top of the code

# -*- coding: UTF-8 -*-

Prevent non-ASCII character'xe5'in file error problems.

Finally, put the github code address:

https://github.com/wangdezhen/pythonspider2018.git

Topics: Python Linux Firefox yum

Programmer Think

This is probably the most tedious Python crawler introductory tutorial 2-100

Preface

Network request module requests

Install requests

Python Crawler Page Analysis

enter field name

Hot Topics