1, Font anti crawl
Introduction of font anti pickpocketing based on the case of starting point Chinese network
Requirements: https://www.qidian.com/rank/yuepiao/ Get the title of the book and the number of monthly tickets ranked in the monthly ticket list of the starting point Chinese website
Through packet capturing, we can find that the book titles and the number of monthly tickets we need are in html format, so we need to use the etree method in lxml and parse them with xpath
import requests from lxml import etree from fake_useragent import FakeUserAgent if __name__ == '__main__': # 1. Confirm the url of the target url_ = 'https://www.qidian.com/rank/yuepiao/' # 2. Construct request header information headers_ = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723', 'Referer': 'https://www.qidian.com/rank/' } # 3. Send a request and get the corresponding information response_ = requests.get(url_, headers=headers_) data_ = response_.text # Check whether the corresponding information obtained is correct with open('qidian.html', 'w', encoding='utf-8') as f: f.write(data_)
It should be noted that the starting point Chinese website is a large website. The request header information should be written as complete as possible. Check whether the response object contains the data we need
After inspection, the data we need is in the response object. In the next step, we need to extract the required data from the response object. Because it is html format data, the key to extracting data is to debug xpath syntax and analyze it before extraction. There are 20 books on one page, that is, the extraction result should also be 20
Book title xpath: / / h4/a/text()
Monthly ticket quantity xpath: / / span/span/text() or / / span[@class="IuAmFihj"]/text()
Note: the second xpath syntax can get data during browser debugging, but when we run the program in pycharm, we will find that the corresponding data cannot be extracted, because the class attribute value of span will change every time we visit the website
import requests from lxml import etree from fake_useragent import FakeUserAgent if __name__ == '__main__': # 1. Confirm the url of the target url_ = 'https://www.qidian.com/rank/yuepiao/' # 2. Construct request header information headers_ = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723', 'Referer': 'https://www.qidian.com/rank/' } # 3. Send a request and get the corresponding information response_ = requests.get(url_, headers=headers_) data_ = response_.text # # Check whether the corresponding information obtained is correct # with open('qidian.html', 'w', encoding='utf-8') as f: # f.write(data_) # 4. Analyze the data to obtain the title of the book and the number of monthly tickets html_obj = etree.HTML(data_) book_list = html_obj.xpath('//h4/a/text()') num_list = html_obj.xpath('//span/span/text()') print(book_list) print(num_list)
Through the normal process, we should get the title of the book and the number of monthly tickets, but printing the data we extracted will find the following situation
['Nomenclature of night', 'Unscientific Royal beast', 'I have a myth tree', 'I just don't play cards according to the routine', 'From the red moon', 'My cloud girlfriend', 'Great dreamer', 'The other side of deep space', 'This man is too serious', 'Douluo continent V Rebirth of Tang San', 'Fairy fox', 'Dafeng Dageng man', 'Stargate', 'Terran garrison envoy', 'One Qiu eight in the northern mansion of the Eastern Jin Dynasty', 'I can only talk to S First class goddess in love', 'I really don't want to see it bug', 'Steady, don't wave', 'Sincere Sky Survey', 'Full time artist'] ['π»πΊπ΄π»πΆ', 'π·π±πΊπΊπΉ', 'π·πΉπ·π·π΅', 'π·π΅π΄π΅π·', 'π·π΄π΄πΆπ΄', 'πΆπ³πΉπ΅πΊ', 'πΆπΊπ±π·πΊ', 'πΆπ΅π·π±πΊ', 'πΆπ΅π·πΊπ΅', 'πΆπ΅π΄πΆπ·', 'πΆπ΄πΊπΈπ³', 'πΈπ»πΆπ³πΊ', 'πΈπΊπΈπ±πΆ', 'πΈπΉπ±π±π΄', 'πΈπΈπ»π±π³', 'π΅π³π·π»πΈ', 'π΅π·π³πΊπΉ', 'π΅πΉπ³π»π»', 'π΅πΈπ±π·πΆ', 'π΅π΅π±π·πΉ']
The title of the book can be displayed normally, but the number of monthly tickets is all garbled, which is what we call font anti climbing
Font backcrawl: 1 The class attribute value of span will change every time you visit the website
2. No real data can be obtained without special processing
analysis:
Number of monthly tickets directly copied on the page as a normal user:
πΊπΌπ πΌπMonthly Ticket
It can be seen that when we copy the monthly ticket quantity data directly as normal users, there is garbled code, so it is even more impossible for ordinary crawlers to get the real data
Statement corresponding to the number of monthly tickets found in yuepiao / data package:
𘛽𘛽𘛿𘜀𘜄</ Span > < / span > monthly ticket</p>
You can see that the number seems to have changed into a format like 𘛽 this
There are three woff files in the font in the network. This file is actually used for font encryption
Which one are we going to use?
We click the monthly ticket data in the page with the small arrow in the upper left corner of the inspection to jump to the label corresponding to the data. The value of the class attribute in the front has the same name as one of the three woff files, so we guess that this woff file is to be used
Download woff file: 1 Double click download
2. Using python code to send a request to download
It seems that we can't view the specific contents of woff font encryption file. At this time, we need a third-party library: fontTools needs to be downloaded by ourselves
pip install fonttools
use:
from fontTools.ttLib import TTFont # Create the object, and the parameter is the font encryption file font_obj = TTFont('FryVjKMa.woff') # Conversion format font_obj.saveXML('font.xml')
Note: it is "fontTools" when downloading and "fontTools" when importing
After conversion, we search cmap and find:
<cmap> <tableVersion version="0"/> <cmap_format_4 platformID="0" platEncID="3" language="0"> </cmap_format_4> <cmap_format_0 platformID="1" platEncID="0" language="0"> </cmap_format_0> <cmap_format_4 platformID="3" platEncID="1" language="0"> </cmap_format_4> <cmap_format_12 platformID="3" platEncID="10" format="12" reserved="0" length="148" language="0" nGroups="11"> <map code="0x188c0" name="eight"/><!-- TANGUT COMPONENT-193 --> <map code="0x188c2" name="one"/><!-- TANGUT COMPONENT-195 --> <map code="0x188c3" name="zero"/><!-- TANGUT COMPONENT-196 --> <map code="0x188c4" name="three"/><!-- TANGUT COMPONENT-197 --> <map code="0x188c5" name="period"/><!-- TANGUT COMPONENT-198 --> <map code="0x188c6" name="four"/><!-- TANGUT COMPONENT-199 --> <map code="0x188c7" name="two"/><!-- TANGUT COMPONENT-200 --> <map code="0x188c8" name="nine"/><!-- TANGUT COMPONENT-201 --> <map code="0x188c9" name="six"/><!-- TANGUT COMPONENT-202 --> <map code="0x188ca" name="five"/><!-- TANGUT COMPONENT-203 --> <map code="0x188cb" name="seven"/><!-- TANGUT COMPONENT-204 --> </cmap_format_12> </cmap>
This is the conversion rule of font encryption, and the map tag is the corresponding relationship (mapping table)
After reading the code, you can guess: 0x188c0 corresponds to 8, 0x188c2 corresponds to 1 (the beginning of 0x indicates hexadecimal number)
We convert it to hexadecimal number:
print(int(0x188c0)) # 100544 print(int(0x188c2)) # 100546
You can find the statement corresponding to the number of monthly tickets found in the yuepiao / packet:
π½&# 100093;&# 100095;&# 100096;&# 100100;</ Span > < / span > the numbers in the monthly ticket < / P > are very similar
Therefore, it can be concluded that 0x188c0 corresponds to 8, and the decimal number of 0x188c0 100544100544 corresponds to 8
After finding the logic of correspondence, how to quickly get the correspondence table?
from fontTools.ttLib import TTFont # Create the object, and the parameter is the font encryption file font_obj = TTFont('FryVjKMa.woff') # Conversion format font_obj.saveXML('font.xml') # Get the relationship mapping table of the map node res_ = font_obj.getBestCmap() print(res_) ''' {100544: 'eight', 100546: 'one', 100547: 'zero', 100548: 'three', 100549: 'period', 100550: 'four', 100551: 'two', 100552: 'nine', 100553: 'six', 100554: 'five', 100555: 'seven'} '''
Looking at the results, the getBestCmap method helps us automatically convert hexadecimal numbers into decimal numbers and present the corresponding relationship in the form of a dictionary
When we do crawlers, it is unrealistic to download font encryption files manually, so we need to use code to download font encryption files
We search for woff in the response of yuepiao / packet, and we can find that the url of the font encryption file is in the response
xpath syntax: / / p/span/style/text()
There are 20 results. Because there are 20 books on the page, each book corresponds to the same woff file, so just take one of them
import json import re from fontTools.ttLib import TTFont import requests from lxml import etree if __name__ == '__main__': # 1. Confirm the url of the target url_ = 'https://www.qidian.com/rank/yuepiao/' # 2. Construct request header information headers_ = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723', 'Referer': 'https://www.qidian.com/rank/' } # 3. Send a request and get the corresponding information response_ = requests.get(url_, headers=headers_) data_ = response_.text # # Check whether the corresponding information obtained is correct # with open('qidian.html', 'w', encoding='utf-8') as f: # f.write(data_) # 4. Analyze the data and obtain the font, encrypted file, book title and number of monthly tickets html_obj = etree.HTML(data_) # Get book title book_list = html_obj.xpath('//h4/a/text()') # Get font encryption file str_ = html_obj.xpath('//p/span/style/text()')[0] ''' @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important; display: initial !important; color: inherit !important; vertical-align: initial !important; } ''' # The url from which to extract the font encryption file font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0] # Send a request to the url of the font encryption file to obtain the corresponding file font_response = requests.get(font_url, headers=headers_) # Save font encryption file with open('font.woff', 'wb') as f: f.write(font_response.content) # Parsing font encrypted files font_obj = TTFont('font.woff') # xml file converted to plaintext format font_obj.saveXML('font.xml') # Get the relationship mapping table of the map node (hexadecimal - > decimal) res_ = font_obj.getBestCmap() # Convert English numbers in the relationship mapping table to Arabic numbers dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'zero': '0'} for i in res_: for j in dict_: if res_[i] == j: res_[i] = dict_[j] # Get monthly ticket quantity:𘛽𘛽𘛿𘜀𘜄 format num_ = re.findall(r'</style><span class=".*?">(.*?)</span></span>Monthly Ticket</p>', data_) # Remove&# list_ = [] for i in num_: list_.append(re.findall(r'\d+', i)) # Replace with one Arabic numeral for i in list_: for j in enumerate(i): for k in res_: if j[1] == str(k): i[j[0]] = res_[k] # ['7', '6', '2', '1', '2'] splicing for i, j in enumerate(list_): new = ''.join(j) list_[i] = new # 5. Save the title of the book and the corresponding number of monthly tickets with open('Starting point Chinese website monthly list.json', 'a', encoding='utf-8') as f: for i in range(len(book_list)): book_dict = {} book_dict[book_list[i]] = list_[i] json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n' f.write(json_data)
2, Case page turning
import json import re import time from fontTools.ttLib import TTFont import requests from lxml import etree if __name__ == '__main__': for i in range(1, 6): # 1. Confirm the url of the target url_ = f'https://www.qidian.com/rank/yuepiao/page{i}' # 2. Construct request header information headers_ = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723', 'Referer': 'https://www.qidian.com/rank/' } # 3. Send a request and get the corresponding information response_ = requests.get(url_, headers=headers_) data_ = response_.text # # Check whether the corresponding information obtained is correct # with open('qidian.html', 'w', encoding='utf-8') as f: # f.write(data_) # 4. Analyze the data and obtain the font, encrypted file, book title and number of monthly tickets html_obj = etree.HTML(data_) # Get book title book_list = html_obj.xpath('//h4/a/text()') # Get font encryption file str_ = html_obj.xpath('//p/span/style/text()')[0] ''' @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important; display: initial !important; color: inherit !important; vertical-align: initial !important; } ''' # The url from which to extract the font encryption file font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0] # Send a request to the url of the font encryption file to obtain the corresponding file font_response = requests.get(font_url, headers=headers_) # Save font encryption file with open('font.woff', 'wb') as f: f.write(font_response.content) # Parsing font encrypted files font_obj = TTFont('font.woff') # xml file converted to plaintext format font_obj.saveXML('font.xml') # Get the relationship mapping table of the map node (hexadecimal - > decimal) res_ = font_obj.getBestCmap() # Convert English numbers in the relationship mapping table to Arabic numbers dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8', 'nine': '9', 'zero': '0'} for i in res_: for j in dict_: if res_[i] == j: res_[i] = dict_[j] # Get monthly ticket quantity:𘛽𘛽𘛿𘜀𘜄 format num_ = re.findall(r'</style><span class=".*?">(.*?)</span></span>Monthly Ticket</p>', data_) # Remove&# list_ = [] for i in num_: list_.append(re.findall(r'\d+', i)) # Replace with one Arabic numeral for i in list_: for j in enumerate(i): for k in res_: if j[1] == str(k): i[j[0]] = res_[k] # ['7', '6', '2', '1', '2'] splicing for i, j in enumerate(list_): new = ''.join(j) list_[i] = new # 5. Save the title of the book and the corresponding number of monthly tickets with open('Starting point Chinese website monthly list.json', 'a', encoding='utf-8') as f: for i in range(len(book_list)): book_dict = {} book_dict[book_list[i]] = list_[i] json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n' f.write(json_data) # 6. Reduce request frequency time.sleep(1)