Crawling exercise - crawling the dynamic information of users in the simplified Book Web (dealing with AJAX)

Posted by effigy on Tue, 21 Jan 2020 18:48:08 +0100

Preface:

In response to AJAX dynamic loading, users' dynamic information of Jianshu network should be crawled, and the crawled data should be stored in MongoDB database

In order to sort out the code, sort out the ideas and verify the validity of the code -- January 21, 2020

Environmental Science:
Python3(Anaconda3)
PyCharm
Chrome browser

Main module: followed by the instructions in brackets for installing in the cmd window
requests(pip install requests)
lxml(pip install lxml)
pymongo(pip install pymongo )

1

First of all, we briefly introduce the asynchronous loading (AJAX), which is actually a technology that can update some web pages without reloading the whole web page. 1

The embodiment in the web page is shown in the screenshot. After clicking "article" and "dynamic", their url hasn't changed. Here is the so-called use of asynchronous loading (AJAX).

2

When crawling information, how do we deal with it? Open developer tool F12, switch to Network interface and select XHR file, click dynamic, then the developer tool interface should be as shown in the screenshot, and a file named timeline? ﹣ pjax =% 23list container will be generated. This is the first step. Find the dynamic file and get the URL of the real request (box up Now.

3

At this time, we will try to simplify the deletion of some unnecessary parameters in this URL.

# Original URL
https://www.jianshu.com/users/c5a2ce84f60b/timeline?_pjax=%23list-container

# After streamlining
https://www.jianshu.com/users/c5a2ce84f60b/timeline

In this way, we can construct other URLs through the reduced URLs.

4

When we want to find other pages, we find that there is no paging bar, and the paging of jianshu.com is also realized by asynchronous loading. Let's slide the page to see which files are loaded.

Here, I leave a page parameter for the whimsical first, such as https://www.jianshu.com/users/c5a2ce84f60b/timeline?page=2, but it is not so simple.

Yes, here is the interface. BUT, this interface is the same as the interface on the first page. It means that the other parameter, Max ㄒ ID, is also an important parameter. Then, we will consider how to obtain Max ㄒ ID.

5

We need a pair of big eyes. Through various searches, we found that the ID attribute value of the last li element tag in these XHR files is the next page's max_id+1. OK, solve the problem smoothly.

6

The previous 2-6 steps are to deal with the AJAX method of jianshu.com, which is called reverse engineering method. After dealing with this kind of "anti Crawler", the next work is to obtain information and insert it into the MongoDB database.
The information we need to crawl is of dynamic type (below, "like" means "like an article") and dynamic release time.

7

I won't talk about the specific web page analysis here. For details, please refer to the complete code. The code is relatively simple, and I have annotated each block. Please leave a message or private message for comments.

Complete code

# url = "https://www.jianshu.com/users/c5a2ce84f60b/timeline?_pjax=%23list-container"
# Import library
import requests
from lxml import etree
import pymongo

# Connect to MongoDB database
client = pymongo.MongoClient('localhost', 27017)

# Create databases and data collections
mydb = client['mydb']
timeline = mydb['timeline']

# Join request header
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}


def get_time_info(url, page):
    # Split the url to obtain the user id, for example: https://www.jianshu.com/users/c5a2ce84f60b/timeline
    user_id = url.split('/')
    user_id = user_id[4]

    # If it is the url after the first page, it will contain the words' page = 'and let it turn the page
    if url.find('page='):
        page = page+1

    html = requests.get(url=url, headers=headers)
    selector = etree.HTML(html.text)
    print(url, html.status_code)

	# Firstly, it is divided into many li blocks to facilitate subsequent data analysis
    infos = selector.xpath('//ul[@class="note-list"]/li')
    for info in infos:
        # time
        dd = info.xpath('div/div/div/span/@data-datetime')[0]
        # Dynamic type
        type = info.xpath('div/div/div/span/@data-type')[0]
        
        # Insert data in json or dictionary format
        timeline.insert_one({'date': dd, 'type': type})
        print({'date': dd, 'type': type})

    # Get id to construct the url of dynamic page
    id_infos = selector.xpath('//ul[@class="note-list"]/li/@id')
    if len(infos) > 1:
        feed_id = id_infos[-1]
        # The original feed ID, for example: feed-578127155, needs manual segmentation
        max_id = feed_id.split('-')[1]
        # Construct the url of dynamic page
        next_url = 'http://www.jianshu.com/users/%s/timeline?max_id=%s&page=%s' % (user_id, max_id, page)
        # Recursive call, crawling the information on the next page
        get_time_info(next_url, page)


if __name__ == '__main__':
    get_time_info('https://www.jianshu.com/users/c5a2ce84f60b/timeline', 1)
  1. Ajax (ajax development) ↩︎

Published 51 original articles, won praise 66, visited 7079
Private letter follow

Topics: MongoDB Database pip network