Crawler | Python grabs all answers under the question (with complete code)

Posted by jv2222 on Wed, 29 Sep 2021 23:50:28 +0200

This issue introduces you step by step how to crawl all the Q & a related data. I hope it will be helpful to you.

Target website:

https://www.zhihu.com/question/368550554

Crawl field: crawl the release time, author, approval number, content, etc. of all answers under the question (other fields can be added as needed).

1. Web page analysis

1.1 web page debugging

F12 open the browser debugging window to find the url of the loaded data. Because you know the display restrictions of the Q & a page, you need to load several more pages down to find the rules:

Search the web content arbitrarily. Take the sentence "recommend two movies that have been underestimated by me" as an example. As shown in the above figure, we are in answers
The answer content is found under the url of this index, and it is completely consistent.

1.2 request parameter analysis

Requested URL:

Data parameter:

Here, focus on the limit and offset parameters. What are the functions of these two parameters? Let's search the answers link to see:

Rule: the limit is always 5, and the offset increases with 5 as the tolerance (5 / 10 / 15 / 20...)
In fact, limit means to return 5 answers per request, and offset means to start from the first few answers.

2. Data crawling

2.1 crawling test

According to the URL analyzed above, let's construct a request to test:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}

for i in range(2):
    url = f'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={i}&platform=desktop&sort_by=default'
    r = requests.get(url, headers=headers)
    r.raise_for_status()
    r.encoding = 'utf-8'
    datas = json.loads(r.text)
    for info in datas['data']:
        author = info['author']['name']
        created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
        voteup_count = info['voteup_count']
        text = info['excerpt']
        oneinfo = [created_time,author,voteup_count,text]
        print(oneinfo)
    print('+++++++++++++++++++++++++++')

result:

So we can loop through all the data.
After testing, the limit can be changed to 20 at most, so that the number of cycles will be reduced to one quarter of the original.

2.2 optimization

According to the above crawling methods, we need to know how many pages of data there are in advance, and then we can determine the number of cycles.

Number of cycles = int (number of data pieces / offset)
In fact, careful friends will find that when we obtain the pilot data, we also have some page information:

The following is given directly:

previous link

Link to next page (next)

Is it _start

Is it_end

Total

Recursive crawling:

def getinfo(url, headers):
    allinfo = []
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = 'utf-8'
        datas = json.loads(r.text)
        for info in datas['data']:
            author = info['author']['name']
            created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
            voteup_count = info['voteup_count']
            text = info['excerpt']
            oneinfo = [created_time,author,voteup_count,text]
            print(oneinfo)
            allinfo.append(oneinfo)
        next_url = datas['paging']['next']
        if datas['paging']['is_end']:
            print('----')
            return 
        
        time.sleep(random.uniform(0.1, 20.1))
        return getinfo(next_url, headers)
    except:
        return getinfo(next_url, headers)

3. Save data

3.1 openpyxl

Here we use openpyxl to save data to Excel. You can also try to save other files or databases:

def insert2excel(filepath,allinfo):
    try:
        if not os.path.exists(filepath):
            tableTitle = ['Release time', 'user name', 'Approval number', 'content']
            wb = Workbook()
            ws = wb.active
            ws.title = 'sheet1'
            ws.append(tableTitle)
            wb.save(filepath)
            time.sleep(3)
        wb = load_workbook(filepath)
        ws = wb.active
        ws.title = 'sheet1'
        for info in allinfo:
            ws.append(info)
        wb.save(filepath)
        print('File updated')
    except:
        print('File update failed')

effect:

3.2 complete code

The following is all the code, which can be run directly locally. There are many places that can be optimized, which can be modified by the partners themselves:

import os
import json
import time
import math
import random
import requests
from openpyxl import load_workbook, Workbook

# data acquisition
def getinfo(url, headers):
    allinfo = []
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = 'utf-8'
        datas = json.loads(r.text)
        for info in datas['data']:
            author = info['author']['name']
            created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
            voteup_count = info['voteup_count']
            text = info['excerpt']
            oneinfo = [created_time,author,voteup_count,text]
            print(oneinfo)
            allinfo.append(oneinfo)
        next_url = datas['paging']['next']
        insert2excel(filepath,allinfo)
        
        if datas['paging']['is_end']:
            print('----')
            return 
        
        time.sleep(random.uniform(5.1, 20.1))
        return getinfo(next_url, headers)
    except:
        return getinfo(next_url, headers)

# Data saving
def insert2excel(filepath,allinfo):
    try:
        if not os.path.exists(filepath):
            tableTitle = ['Release time', 'user name', 'Approval number', 'content']
            wb = Workbook()
            ws = wb.active
            ws.title = 'sheet1'
            ws.append(tableTitle)
            wb.save(filepath)
            time.sleep(3)
        wb = load_workbook(filepath)
        ws = wb.active
        ws.title = 'sheet1'
        for info in allinfo:
            ws.append(info)
        wb.save(filepath)
        print('File updated')
    except:
        print('File update failed')
        
url = 'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=20&offset=0&platform=desktop&sort_by=default'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
filepath = '368550554.xlsx'
getinfo(url, headers)

The above is all the content sorted out for you in this issue. Practice quickly. It's not easy to be original. Friends who like can praise, collect or share (indicate the source) to let more people know.

Programmer Think