Introduction to Python Crawler [9]: Graphic worm network multithreaded crawling

Posted by plasmahba on Mon, 22 Jul 2019 16:21:17 +0200

Graphics Worm Network - Write in front

After a crackling operation, I finally wrote my blog to the 10th article. Later, it will gradually involve more crawler modules. Someone asked scrapy when to start using it. I expect it to be 30 articles later. The following routine is still slow, so don't worry. 100 articles are expected to be 4-5 months. After writing, the common anti-crawling will also be written, as well as the content of the fuck login class.

Graphic insect net-crawling graph insect net

Why to crawl this website? I don't know why it's amazing to get it. It feels that the picture quality is good. It's not comparable to those gorgeous and cheap goods. So I started to crawl. After searching the website, some people are also crawling, but basically py2 and py3 have not been written yet. So I'd like to write a handwritten article.

Start Page

https://tuchong.com/explore/
There are many labels on this page. There are many pictures under each label. For harmony, I chose a very good label flower. You can choose other label flowers. You can even climb down all of them.

https://tuchong.com/tags/%E8%8A%B1%E5%8D%89/# Flowers are coded as%E8%8A%B1%E5%8D%89. It doesn't matter.

We're also playing with queue s in python, which we haven't written about before.

Here are some of the explanations I've come by from other people. So many basic reptiles were used in the early days.

1. Initialization: class Queue.Queue(maxsize) FIFO FIFO

2\. Common methods in packages:

    - queue.qsize() Return queue size
    - queue.empty() If the queue is empty, return True,Conversely False
    - queue.full() If the queue is full, return True,Conversely False
    - queue.full and maxsize Size correspondence
    - queue.get([block[, timeout]])Get the queue, timeout waiting time

3. Create a Queue Object
    import queue
    myqueue = queue.Queue(maxsize = 10)

4. Put a value in the queue
    myqueue.put(10)

5. Remove a value from the queue
    myqueue.get()

Start coding

First of all, let's implement the framework of the main methods. I still put some core points on the annotations.

def main():
    # Declare a queue in which 100 pages are stored using a loop
    page_queue  = Queue(100)
    for i in range(1,101):
        page_queue.put(i)

    # Acquisition results (image address waiting to be downloaded)
    data_queue = Queue()

    # Record a list of threads
    thread_crawl = []
    # Open 4 threads at a time
    craw_list = ['Acquisition Thread 1','Acquisition Thread 2','Acquisition Thread 3','Acquisition Thread 4']
    for thread_name in craw_list:
        c_thread = ThreadCrawl(thread_name, page_queue, data_queue)
        c_thread.start()
        thread_crawl.append(c_thread)

    # Waiting for the page_queue queue to be empty, that is, waiting for the previous operation to complete
    while not page_queue.empty():
        pass

if __name__ == '__main__':
    main()
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

After the code runs, four threads are successfully started and then waiting for the thread to finish. Note that you need to complete the ThreadCrawl class.

class ThreadCrawl(threading.Thread):

    def __init__(self, thread_name, page_queue, data_queue):
        # threading.Thread.__init__(self)
        # Invoke parent class initialization method
        super(ThreadCrawl, self).__init__()
        self.threadName = thread_name
        self.page_queue = page_queue
        self.data_queue = data_queue

    def run(self):
        print(self.threadName + ' start-up************')

Operation results

Threads have been started. In the run method, just add the code to crawl the data. This place introduces a global variable to identify the crawl state.
CRAWL_EXIT = False

First add the following code to the main method

CRAWL_EXIT = False  # This variable is declared at this location
class ThreadCrawl(threading.Thread):

    def __init__(self, thread_name, page_queue, data_queue):
        # threading.Thread.__init__(self)
        # Invoke parent class initialization method
        super(ThreadCrawl, self).__init__()
        self.threadName = thread_name
        self.page_queue = page_queue
        self.data_queue = data_queue

    def run(self):
        print(self.threadName + ' start-up************')
        while not CRAWL_EXIT:
            try:
                global tag, url, headers,img_format  # Bring in the global value.
                # Queue empty produces an exception
                page = self.page_queue.get(block=False)   # Get values from it
                spider_url = url_format.format(tag,page,100)   # The URL to be crawled by splicing
                print(spider_url)
            except:
                break

            timeout = 4   # The qualified place is to try to get three times, fail three times, and jump out.
            while timeout > 0:
                timeout -= 1
                try:
                    with requests.Session() as s:
                        response = s.get(spider_url, headers=headers, timeout=3)
                        json_data = response.json()
                        if json_data is not None:
                            imgs = json_data["postList"]
                            for i in imgs:
                                imgs = i["images"]
                                for img in imgs:
                                    img = img_format.format(img["user_id"],img["img_id"])
                                    self.data_queue.put(img)  # After capturing the link to the picture, store it in a new queue and wait for the next operation.

                    break

                except Exception as e:
                    print(e)

            if timeout <= 0:
                print('time out!')
def main():
    # Code above

    # Waiting for the page_queue queue to be empty, that is, waiting for the previous operation to complete
    while not page_queue.empty():
        pass

    # If page_queue is empty, the collection thread exits the loop
    global CRAWL_EXIT
    CRAWL_EXIT = True

    # Test if there is value in the queue
    print(data_queue)

After testing, data_queue contains data!! Ha-ha, the following is using the same operation, just download the picture.

Improving main method

def main():
    # Code above

    for thread in thread_crawl:
        thread.join()
        print("Grab thread termination")

    thread_image = []
    image_list = ['Download Thread 1', 'Download Thread 2', 'Download Thread 3', 'Download Thread 4']
    for thread_name in image_list:
        Ithread = ThreadDown(thread_name, data_queue)
        Ithread.start()
        thread_image.append(Ithread)

    while not data_queue.empty():
        pass

    global DOWN_EXIT
    DOWN_EXIT = True

    for thread in thread_image:
        thread.join()
        print("End of download thread")

Or add a ThreadDown class, which is used to download pictures.

class ThreadDown(threading.Thread):
    def __init__(self, thread_name, data_queue):
        super(ThreadDown, self).__init__()
        self.thread_name = thread_name
        self.data_queue = data_queue

    def run(self):
        print(self.thread_name + ' start-up************')
        while not DOWN_EXIT:
            try:
                img_link = self.data_queue.get(block=False)
                self.write_image(img_link)
            except Exception as e:
                pass

    def write_image(self, url):

        with requests.Session() as s:
            response = s.get(url, timeout=3)
            img = response.content   # Getting Binary Stream

        try:
            file = open('image/' + str(time.time())+'.jpg', 'wb')
            file.write(img)
            file.close()
            print('image/' + str(time.time())+'.jpg The picture is downloaded.')

        except Exception as e:
            print(e)
            return
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

After running, just wait for the picture to download.~~

Key annotations have been added to the code. Take a picture (_). This time the code will be uploaded back to github because it is relatively simple.

When you modify the above flowers to something like xx, it's a fairy beyond the sky.

Topics: Python Session network JSON