Breakpoint crawler implementation II

Posted by TonyB on Fri, 12 Nov 2021 14:59:45 +0100

title: breakpoint crawler implementation (II)
author: LiSoul
date: 2021-11-12

I believe the students here have read the first article I wrote. In this next article, we will mainly talk about how to combine the code of the previous article to optimize our code

1. Algorithm

At this point, let's talk about two algorithms.
I believe those who have studied the data structure know that there are two classic search algorithms in the traversal of the tree. They are depth first search and breadth first search. If you have learned this part of knowledge, you can skip this part. If you haven't learned it, don't worry. You will slowly analyze it here.

  • Depth first search (DFS)

    Depth first search, according to its literal meaning, we can understand that it will give priority to searching to the next layer until the tree has no child nodes, and the program will return to the previous layer to search for another node. Its operation is very similar to the preorder traversal of a tree.

  • Breadth first search (BFS)

    Breadth first search will traverse all nodes of the current layer before traversing the next layer. Its running process is very similar to sequence traversal of a tree.

  • code implementation

// C + + code implementation template
#include <bits/stdc++.h>
using namespace std;

map<int, vector<int>> L;

// Implementation of depth first search
void dfs(int node)
{
    for (int item : L[node])
    {
        cout << item << endl;
        if (!L[item].empty())
        {
            dfs(item);
        }
    }
}

// Breadth first search implementation
void bfs()
{
    queue<int> ll;
    ll.push(1);
    while (!ll.empty())
    {
        int temp = ll.front();
        ll.pop();
        for (int item : L[temp])
        {
            ll.push(item);
            cout << item << endl;
        }
    }
}

int main()
{
    L[1] = {2, 3, 4};
    L[2] = {5, 6, 7};
    L[3] = {8, 9, 10};
    L[4] = {11, 12, 13};
    dfs(1);
    cout << "---------------" << endl;
    bfs();
}
# Implementation of python code template
L = [
    [],
    [2, 3, 4],
    [5, 6, 7],
    [8, 9, 10],
    [11, 12, 13]
]

# Depth first search
def dfs(node):
    for item in L[node]:
        print(item)
        if item < len(L):
            dfs(item)

# Breadth first search
def bfs():
    ll = []
    ll.append(1)
    while ll:
        temp = ll[0]
        ll.pop(0)
        for item in L[temp]:
            if item < len(L):
                ll.append(item)
            print(item)


if __name__ == '__main__':
    dfs(1)
    print('--------------------')
    bfs()

The effect of the above two codes is the same

2. Code implementation

Now that we know about depth first search and breadth first search, let's talk about the implementation of our code

First, I post the code I just started to write (through depth first search)

def get_data_by_dfs(url, deep, father_id=0):
    global book
    message = f'''
        ---------------------Start crawling--------------------
        Current crawled node:\t{father_id}
        Current crawl depth:\t{deep}
        -------------------------------------------------
        '''
    print(message)

    temp_url = url.split('/')[:-1]      # Crawling address processing is convenient for the splicing of the address of the next layer

    if deep == 1:
        temp_data = fetch_province_list(url=url)
        for item in temp_data:
            # print(item)
            book["province"].append(((item['id'], item['name'], item['url'])))

            cur_url = temp_url + [item['url']]
            cur_url = "/".join(cur_url)     # url address splicing
            get_data_by_dfs(cur_url, deep+1, father_id=item['id'])
    elif 1 < deep <= 4:
        temp_data = fetch_district_list(url=url, deep=deep)
        for item in temp_data:
            if deep == 2:
                print(item)
                book["municipality"].append(
                    ((item['id'], item['name'], item['url'], father_id)))
                # time.sleep(60)
            elif deep == 3:
                book["district"].append(
                    ((item['id'], item['name'], item['url'], father_id)))
                # time.sleep(30)
            elif deep == 4:
                book["township"].append(
                    ((item['id'], item['name'], item['url'], father_id)))
                # time.sleep(10)

            cur_url = temp_url + [item['url']]
            cur_url = "/".join(cur_url)     # url address splicing
            get_data_by_dfs(cur_url, deep+1, father_id=item['id'])

    elif deep == 5:
        temp_data = fetch_village_list(url=url)
        for item in temp_data:
            # print(item)
            book["village"].append(
                ((item['id'], item['city_id'], item['name'], father_id)))
    message = f'''
        ---------------------End of crawling---------------------
        Current crawled node:\t{father_id}
        Current crawl depth:\t{deep}
        -------------------------------------------------
        '''
    print(message)

By running, I found that the program will make all kinds of errors when crawling a large amount of data. If we suddenly quit because of an exception after the program runs for a few hours, we will start from scratch the next time we crawl, which will consume a lot of our time. In order to solve this problem, consider whether we can store the data generated during the run, and then call the data at the end of the last run at the next run. However, it is not easy to store running data in this recursive program, so there is a later version to run the program through extensive search.

def get_data_by_bfs():
    global book
    queue = []
    file = open('data.json', 'r')
    text = file.read()
    file.close()
    data = json.loads(text)
    for item in data['data']:
        queue.append(item)
    while queue:
        temp_data = queue[0]
        queue.pop(0)
        print(temp_data)
        father_id = temp_data['father_id']
        deep = temp_data['deep']
        url = temp_data['url']
        temp_url = url.split('/')[:-1]
        message = f'''
            ---------------------Start crawling--------------------
            Current crawled node:\t{father_id}
            Current crawl address:\t{url}
            Current crawl depth:\t{deep}
            -------------------------------------------------
            '''
        print(message)
        if deep == 1:
            address = fetch_province_list(url=url)
            for item in address:
                print(item)
                book["province"].append(
                    ((item['id'], item['name'], item['url'])))
                cur_url = temp_url + [item['url']]
                cur_url = "/".join(cur_url)
                queue.append(
                    {"father_id": item['id'], "deep": deep+1, "url": cur_url})
        elif 2 <= deep <= 4:
            address = fetch_district_list(url=url, deep=deep)
            for item in address:
                if deep == 2:
                    book["municipality"].append(
                        ((item['id'], item['name'], item['url'], father_id)))
                elif deep == 3:
                    book["district"].append(
                        ((item['id'], item['name'], item['url'], father_id)))
                elif deep == 4:
                    book["township"].append(
                        ((item['id'], item['name'], item['url'], father_id)))
                cur_url = temp_url + [item['url']]
                cur_url = "/".join(cur_url)
                queue.append({"father_id": item['id'],
                              "deep": deep+1, "url": cur_url})
                # print(item)
                message = f'''
                    ---------------------Crawling--------------------
                    Current crawled node:\t{item['id']}
                    The name of the current crawl:\t{item['name']}
                    Current crawl depth:\t{deep}
                    -------------------------------------------------
                    '''
                print(message)
        elif deep == 5:
            address = fetch_village_list(url=url)
            for item in address:
                print(item)
                book["village"].append(
                    ((item['id'], item['city_id'], item['name'], father_id)))
        message = f'''
            ---------------------End of crawling--------------------
            Current crawled node:\t{father_id}
            Current crawl address:\t{url}
            Current crawl depth:\t{deep}
            -------------------------------------------------
            '''
        print(message)
        book.save("address.xlsx")
        content = json.dumps({"data": queue})
        file = open('data.json', 'w')
        file.write(content)
        file.close()

I believe students who have experienced crawlers know that crawlers will be disabled if they climb too much, resulting in being unable to access the website for a period of time. In the next article, let's talk about how to optimize our crawler code

Topics: Python Algorithm crawler