ECommerceCrawlers/TouTiao (code analysis part I)

Posted by sheriff on Sun, 07 Nov 2021 03:42:37 +0100

ECommerceCrawlers/TouTiao details

1, Code overview

Crawler function

Search for a specified field in the header, and store all articles in the search results in csv format.

Code location

Location in the project: ECommerceCrawlers/TouTiao

Location in gitee:

Folder structure

│  ├─pictures
│  │  ├─JosephNest.jpg
│  │  └─mortaltiger.jpg
│  ├─
│  └─

Toutiao: the uppermost directory name of Toutiao crawler

Pictures: picture directory, used for reference. The two pictures in this directory are wechat QR codes description document of the headline crawler crawler code (key)

2, Code explanation

Guide library

import requests

requests Library: it is a Python third-party library and a very easy-to-use http network request library

import time

Time library: Python's own time library

from selenium import webdriver

selenium Library: for Web automated testing

import csv

csv Library: as the name suggests, it is used to operate csv

import pandas as pd

pandas Library: a data analysis library based on Numpy. It is very powerful. Crawlers are often used to access and save data

from urllib.parse import quote

From url lib.parse import quote: the quote method is used to splice URLs

from fake_useragent import UserAgent

from fake_useragent import UserAgent: fake_ The useragent library is used to disguise UA

Evaluation and improvement

The author's guide library is very confused. Why? First, he calls pandas and csv libraries at the same time. Obviously, he wants to store the crawled data in csv, but he obviously can't skillfully use these two libraries. In fact, to write data to csv, we can use pandas.DataFrame() to get a DataFrame object, and then use the to of the DataFrame object_ csv () method. He also called selenium library and requests library at the same time. I really didn't understand this operation. Secondly, finally, from urlib.parse import quote can be replaced by more comprehensive from urlib.parse import URLEncode.

Crawler code analysis

base_url stores the basic URL of the website to be crawled, which is generally used for splicing

base_url = ''

Timestamp, which is generally used for splicing with url, is a parameter in url. The reason for multiplying 1000 is that the request for web page is generally sent in JS, while in JS is 13 bits, while time.time() in Python is 10 bits

timestamp = int(time.time()*1000)

UA is fake_ The instantiated object of useragent. Useragent() is generally used to forge the UA in headers. The common writing method is fake_useragent.UserAgent().random,verify_ssl=False means SSL authentication is ignored

ua = UserAgent(verify_ssl=False)

Initialize empty list article_url_list, which is used to store article links

article_url_list = []

It's true. I don't understand what to do

csv_name = pd.read_csv("typhoon_toutiao.csv")

The following 10 lines of code are used to use proxy ip. The specific writing method of this kind of code depends on the use documents of each proxy ip website. You will naturally know when you want to use proxy ip.

page_urls = ["",
tunnel_host = ""
tunnel_port = "15818"
tid = "xxx"
password = "xxx"
proxies = {
    "http": "http://%s:%s@%s:%s/" % (tid, password, tunnel_host, tunnel_port),
    "https": "https://%s:%s@%s:%s/" % (tid, password, tunnel_host, tunnel_port)

Initialize constraint_ List for weight removal

constract_list = []

This week, I also received two projects. I'm in a hurry. Let's analyze them for the time being.

Topics: Python crawler Data Mining