Python crawler (week 8)

Posted by regiemon on Fri, 31 Dec 2021 14:41:33 +0100

1, Font anti crawl

Introduction of font anti pickpocketing based on the case of starting point Chinese network

Requirements: https://www.qidian.com/rank/yuepiao/ Get the title of the book and the number of monthly tickets ranked in the monthly ticket list of the starting point Chinese website

Through packet capturing, we can find that the book titles and the number of monthly tickets we need are in html format, so we need to use the etree method in lxml and parse them with xpath

import requests
from lxml import etree
from fake_useragent import FakeUserAgent

if __name__ == '__main__':
    # 1. Confirm the url of the target
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2. Construct request header information
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3. Send a request and get the corresponding information
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # Check whether the corresponding information obtained is correct
    with open('qidian.html', 'w', encoding='utf-8') as f:
        f.write(data_)

It should be noted that the starting point Chinese website is a large website. The request header information should be written as complete as possible. Check whether the response object contains the data we need

After inspection, the data we need is in the response object. In the next step, we need to extract the required data from the response object. Because it is html format data, the key to extracting data is to debug xpath syntax and analyze it before extraction. There are 20 books on one page, that is, the extraction result should also be 20

Book title xpath: / / h4/a/text()

Monthly ticket quantity xpath: / / span/span/text() or / / span[@class="IuAmFihj"]/text()

Note: the second xpath syntax can get data during browser debugging, but when we run the program in pycharm, we will find that the corresponding data cannot be extracted, because the class attribute value of span will change every time we visit the website

import requests
from lxml import etree
from fake_useragent import FakeUserAgent

if __name__ == '__main__':
    # 1. Confirm the url of the target
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2. Construct request header information
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3. Send a request and get the corresponding information
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # # Check whether the corresponding information obtained is correct
    # with open('qidian.html', 'w', encoding='utf-8') as f:
    #     f.write(data_)

    # 4. Analyze the data to obtain the title of the book and the number of monthly tickets
    html_obj = etree.HTML(data_)
    book_list = html_obj.xpath('//h4/a/text()')
    num_list = html_obj.xpath('//span/span/text()')
    print(book_list)
    print(num_list)

Through the normal process, we should get the title of the book and the number of monthly tickets, but printing the data we extracted will find the following situation

['Nomenclature of night', 'Unscientific Royal beast', 'I have a myth tree', 'I just don't play cards according to the routine', 'From the red moon', 'My cloud girlfriend', 'Great dreamer', 'The other side of deep space', 'This man is too serious', 'Douluo continent V Rebirth of Tang San', 'Fairy fox', 'Dafeng Dageng man', 'Stargate', 'Terran garrison envoy', 'One Qiu eight in the northern mansion of the Eastern Jin Dynasty', 'I can only talk to S First class goddess in love', 'I really don't want to see it bug', 'Steady, don't wave', 'Sincere Sky Survey', 'Full time artist']
['𘜻𘜺𘜴𘜻𘜶', '𘜷𘜱𘜺𘜺𘜹', '𘜷𘜹𘜷𘜷𘜵', '𘜷𘜵𘜴𘜵𘜷', '𘜷𘜴𘜴𘜶𘜴', '𘜶𘜳𘜹𘜵𘜺', '𘜶𘜺𘜱𘜷𘜺', '𘜶𘜵𘜷𘜱𘜺', '𘜶𘜵𘜷𘜺𘜵', '𘜶𘜵𘜴𘜶𘜷', '𘜶𘜴𘜺𘜸𘜳', '𘜸𘜻𘜶𘜳𘜺', '𘜸𘜺𘜸𘜱𘜶', '𘜸𘜹𘜱𘜱𘜴', '𘜸𘜸𘜻𘜱𘜳', '𘜵𘜳𘜷𘜻𘜸', '𘜵𘜷𘜳𘜺𘜹', '𘜵𘜹𘜳𘜻𘜻', '𘜵𘜸𘜱𘜷𘜶', '𘜵𘜵𘜱𘜷𘜹']

The title of the book can be displayed normally, but the number of monthly tickets is all garbled, which is what we call font anti climbing

Font backcrawl: 1 The class attribute value of span will change every time you visit the website

2. No real data can be obtained without special processing

analysis:

Number of monthly tickets directly copied on the page as a normal user:

𘛺𘛼𘜅𘛼𘜂Monthly Ticket

It can be seen that when we copy the monthly ticket quantity data directly as normal users, there is garbled code, so it is even more impossible for ordinary crawlers to get the real data

Statement corresponding to the number of monthly tickets found in yuepiao / data package:

&#100093;&#100093;&#100095;&#100096;&#100100;</ Span > < / span > monthly ticket</p>

You can see that the number seems to have changed into a format like &#100093 this

There are three woff files in the font in the network. This file is actually used for font encryption

Which one are we going to use?

We click the monthly ticket data in the page with the small arrow in the upper left corner of the inspection to jump to the label corresponding to the data. The value of the class attribute in the front has the same name as one of the three woff files, so we guess that this woff file is to be used

Download woff file: 1 Double click download

2. Using python code to send a request to download

It seems that we can't view the specific contents of woff font encryption file. At this time, we need a third-party library: fontTools needs to be downloaded by ourselves

pip install fonttools

use:

from fontTools.ttLib import TTFont
# Create the object, and the parameter is the font encryption file
font_obj = TTFont('FryVjKMa.woff')

# Conversion format
font_obj.saveXML('font.xml')

Note: it is "fontTools" when downloading and "fontTools" when importing

After conversion, we search cmap and find:

<cmap>
    <tableVersion version="0"/>
    <cmap_format_4 platformID="0" platEncID="3" language="0">
    </cmap_format_4>
    <cmap_format_0 platformID="1" platEncID="0" language="0">
    </cmap_format_0>
    <cmap_format_4 platformID="3" platEncID="1" language="0">
    </cmap_format_4>
    <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="148" language="0" nGroups="11">
      <map code="0x188c0" name="eight"/><!-- TANGUT COMPONENT-193 -->
      <map code="0x188c2" name="one"/><!-- TANGUT COMPONENT-195 -->
      <map code="0x188c3" name="zero"/><!-- TANGUT COMPONENT-196 -->
      <map code="0x188c4" name="three"/><!-- TANGUT COMPONENT-197 -->
      <map code="0x188c5" name="period"/><!-- TANGUT COMPONENT-198 -->
      <map code="0x188c6" name="four"/><!-- TANGUT COMPONENT-199 -->
      <map code="0x188c7" name="two"/><!-- TANGUT COMPONENT-200 -->
      <map code="0x188c8" name="nine"/><!-- TANGUT COMPONENT-201 -->
      <map code="0x188c9" name="six"/><!-- TANGUT COMPONENT-202 -->
      <map code="0x188ca" name="five"/><!-- TANGUT COMPONENT-203 -->
      <map code="0x188cb" name="seven"/><!-- TANGUT COMPONENT-204 -->
    </cmap_format_12>
  </cmap>

This is the conversion rule of font encryption, and the map tag is the corresponding relationship (mapping table)

After reading the code, you can guess: 0x188c0 corresponds to 8, 0x188c2 corresponds to 1 (the beginning of 0x indicates hexadecimal number)

We convert it to hexadecimal number:

print(int(0x188c0))  # 100544
print(int(0x188c2))  # 100546

You can find the statement corresponding to the number of monthly tickets found in the yuepiao / packet:

𘛽&# 100093;&# 100095;&# 100096;&# 100100;</ Span > < / span > the numbers in the monthly ticket < / P > are very similar

Therefore, it can be concluded that 0x188c0 corresponds to 8, and the decimal number of 0x188c0 100544100544 corresponds to 8

After finding the logic of correspondence, how to quickly get the correspondence table?

from fontTools.ttLib import TTFont

# Create the object, and the parameter is the font encryption file
font_obj = TTFont('FryVjKMa.woff')

# Conversion format
font_obj.saveXML('font.xml')

# Get the relationship mapping table of the map node
res_ = font_obj.getBestCmap()
print(res_)

'''
{100544: 'eight', 100546: 'one', 100547: 'zero', 100548: 'three', 100549: 'period', 
 100550: 'four', 100551: 'two', 100552: 'nine', 100553: 'six', 100554: 'five', 
 100555: 'seven'}
'''

Looking at the results, the getBestCmap method helps us automatically convert hexadecimal numbers into decimal numbers and present the corresponding relationship in the form of a dictionary

When we do crawlers, it is unrealistic to download font encryption files manually, so we need to use code to download font encryption files

We search for woff in the response of yuepiao / packet, and we can find that the url of the font encryption file is in the response

xpath syntax: / / p/span/style/text()

There are 20 results. Because there are 20 books on the page, each book corresponds to the same woff file, so just take one of them

import json
import re
from fontTools.ttLib import TTFont
import requests
from lxml import etree

if __name__ == '__main__':
    # 1. Confirm the url of the target
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2. Construct request header information
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3. Send a request and get the corresponding information
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # # Check whether the corresponding information obtained is correct
    # with open('qidian.html', 'w', encoding='utf-8') as f:
    #     f.write(data_)

    # 4. Analyze the data and obtain the font, encrypted file, book title and number of monthly tickets
    html_obj = etree.HTML(data_)
    # Get book title
    book_list = html_obj.xpath('//h4/a/text()')

    # Get font encryption file
    str_ = html_obj.xpath('//p/span/style/text()')[0]
    '''
    @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); 
    src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') 
    format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important;     
    display: initial !important; color: inherit !important; vertical-align: initial !important; }
    '''
    # The url from which to extract the font encryption file
    font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0]
    # Send a request to the url of the font encryption file to obtain the corresponding file
    font_response = requests.get(font_url, headers=headers_)
    # Save font encryption file
    with open('font.woff', 'wb') as f:
        f.write(font_response.content)
    # Parsing font encrypted files
    font_obj = TTFont('font.woff')
    # xml file converted to plaintext format
    font_obj.saveXML('font.xml')
    # Get the relationship mapping table of the map node (hexadecimal - > decimal)
    res_ = font_obj.getBestCmap()
    # Convert English numbers in the relationship mapping table to Arabic numbers
    dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8',
             'nine': '9', 'zero': '0'}
    for i in res_:
        for j in dict_:
            if res_[i] == j:
                res_[i] = dict_[j]

    # Get monthly ticket quantity:&#100093;&#100093;&#100095;&#100096;&#100100; format
    num_ = re.findall(r'</style><span class=".*?">(.*?)</span></span>Monthly Ticket</p>', data_)
    # Remove&#
    list_ = []
    for i in num_:
        list_.append(re.findall(r'\d+', i))
    # Replace with one Arabic numeral
    for i in list_:
        for j in enumerate(i):
            for k in res_:
                if j[1] == str(k):
                    i[j[0]] = res_[k]
    # ['7', '6', '2', '1', '2'] splicing
    for i, j in enumerate(list_):
        new = ''.join(j)
        list_[i] = new

    # 5. Save the title of the book and the corresponding number of monthly tickets
    with open('Starting point Chinese website monthly list.json', 'a', encoding='utf-8') as f:
        for i in range(len(book_list)):
            book_dict = {}
            book_dict[book_list[i]] = list_[i]
            json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n'
            f.write(json_data)

2, Case page turning

import json
import re
import time

from fontTools.ttLib import TTFont
import requests
from lxml import etree

if __name__ == '__main__':
    for i in range(1, 6):
        # 1. Confirm the url of the target
        url_ = f'https://www.qidian.com/rank/yuepiao/page{i}'

        # 2. Construct request header information
        headers_ = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
            'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
            'Referer': 'https://www.qidian.com/rank/'
        }
        # 3. Send a request and get the corresponding information
        response_ = requests.get(url_, headers=headers_)
        data_ = response_.text

        # # Check whether the corresponding information obtained is correct
        # with open('qidian.html', 'w', encoding='utf-8') as f:
        #     f.write(data_)

        # 4. Analyze the data and obtain the font, encrypted file, book title and number of monthly tickets
        html_obj = etree.HTML(data_)
        # Get book title
        book_list = html_obj.xpath('//h4/a/text()')

        # Get font encryption file
        str_ = html_obj.xpath('//p/span/style/text()')[0]
        '''
        @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); 
        src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') 
        format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important;     
        display: initial !important; color: inherit !important; vertical-align: initial !important; }
        '''
        # The url from which to extract the font encryption file
        font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0]
        # Send a request to the url of the font encryption file to obtain the corresponding file
        font_response = requests.get(font_url, headers=headers_)
        # Save font encryption file
        with open('font.woff', 'wb') as f:
            f.write(font_response.content)
        # Parsing font encrypted files
        font_obj = TTFont('font.woff')
        # xml file converted to plaintext format
        font_obj.saveXML('font.xml')
        # Get the relationship mapping table of the map node (hexadecimal - > decimal)
        res_ = font_obj.getBestCmap()
        # Convert English numbers in the relationship mapping table to Arabic numbers
        dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8',
                 'nine': '9', 'zero': '0'}
        for i in res_:
            for j in dict_:
                if res_[i] == j:
                    res_[i] = dict_[j]

        # Get monthly ticket quantity:&#100093;&#100093;&#100095;&#100096;&#100100; format
        num_ = re.findall(r'</style><span class=".*?">(.*?)</span></span>Monthly Ticket</p>', data_)
        # Remove&#
        list_ = []
        for i in num_:
            list_.append(re.findall(r'\d+', i))
        # Replace with one Arabic numeral
        for i in list_:
            for j in enumerate(i):
                for k in res_:
                    if j[1] == str(k):
                        i[j[0]] = res_[k]
        # ['7', '6', '2', '1', '2'] splicing
        for i, j in enumerate(list_):
            new = ''.join(j)
            list_[i] = new

        # 5. Save the title of the book and the corresponding number of monthly tickets
        with open('Starting point Chinese website monthly list.json', 'a', encoding='utf-8') as f:
            for i in range(len(book_list)):
                book_dict = {}
                book_dict[book_list[i]] = list_[i]
                json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n'
                f.write(json_data)

        # 6. Reduce request frequency
        time.sleep(1)

Topics: Python crawler Python crawler

Programmer Think

Python crawler (week 8)

1, Font anti crawl

2, Case page turning

Hot Topics