ECommerceCrawlers/TouTiao details
1, Code overview
Crawler function
Search for a specified field in the header, and store all articles in the search results in csv format.
Code location
Location in the project: ECommerceCrawlers/TouTiao
Location in gitee: https://gitee.com/AJay13/ECommerceCrawlers/tree/master/TouTiao
Folder structure
├─TouTiao │ ├─pictures │ │ ├─JosephNest.jpg │ │ └─mortaltiger.jpg │ ├─README.md │ └─toutiao.py
Toutiao: the uppermost directory name of Toutiao crawler
Pictures: picture directory, used for README.md reference. The two pictures in this directory are wechat QR codes
README.md: description document of the headline crawler
toutiao.py: crawler code (key)
2, Code explanation
Guide library
import requests
requests Library: it is a Python third-party library and a very easy-to-use http network request library
import time
Time library: Python's own time library
from selenium import webdriver
selenium Library: for Web automated testing
import csv
csv Library: as the name suggests, it is used to operate csv
import pandas as pd
pandas Library: a data analysis library based on Numpy. It is very powerful. Crawlers are often used to access and save data
from urllib.parse import quote
From url lib.parse import quote: the quote method is used to splice URLs
from fake_useragent import UserAgent
from fake_useragent import UserAgent: fake_ The useragent library is used to disguise UA
Evaluation and improvement
The author's guide library is very confused. Why? First, he calls pandas and csv libraries at the same time. Obviously, he wants to store the crawled data in csv, but he obviously can't skillfully use these two libraries. In fact, to write data to csv, we can use pandas.DataFrame() to get a DataFrame object, and then use the to of the DataFrame object_ csv () method. He also called selenium library and requests library at the same time. I really didn't understand this operation. Secondly, finally, from urlib.parse import quote can be replaced by more comprehensive from urlib.parse import URLEncode.
Crawler code analysis
base_url stores the basic URL of the website to be crawled, which is generally used for splicing
base_url = 'https://www.toutiao.com/api/search/content/'
Timestamp, which is generally used for splicing with url, is a parameter in url. The reason for multiplying 1000 is that the request for web page is generally sent in JS, while Date.now() in JS is 13 bits, while time.time() in Python is 10 bits
timestamp = int(time.time()*1000)
UA is fake_ The instantiated object of useragent. Useragent() is generally used to forge the UA in headers. The common writing method is fake_useragent.UserAgent().random,verify_ssl=False means SSL authentication is ignored
ua = UserAgent(verify_ssl=False)
Initialize empty list article_url_list, which is used to store article links
article_url_list = []
It's true. I don't understand what to do
csv_name = pd.read_csv("typhoon_toutiao.csv")
The following 10 lines of code are used to use proxy ip. The specific writing method of this kind of code depends on the use documents of each proxy ip website. You will naturally know when you want to use proxy ip.
page_urls = ["http://dev.kdlapi.com/testproxy", "https://dev.kdlapi.com/testproxy",] tunnel_host = "tps189.kdlapi.com" tunnel_port = "15818" tid = "xxx" password = "xxx" proxies = { "http": "http://%s:%s@%s:%s/" % (tid, password, tunnel_host, tunnel_port), "https": "https://%s:%s@%s:%s/" % (tid, password, tunnel_host, tunnel_port) }
Initialize constraint_ List for weight removal
constract_list = []
This week, I also received two projects. I'm in a hurry. Let's analyze them for the time being.