Python crawler Request wheel tool

Posted by vchris on Sat, 19 Feb 2022 07:45:48 +0100

SuperSpider

==Ten thousand words long text, it is recommended to use the directory to click and consult, which is conducive to efficient development. Suggest like collection==

Request capture steps

[1]First determine whether to load the website dynamically
[2]look for URL law
[3]regular expression  | xpath expression
[4]Define the program framework, complete and test the code
  • Key details: check whether the page code charset and verify=FALSE need to be verified during the request

Idea of multi-level page data capture

[1]Overall thinking
    1.1> Crawl first level page,Extract required data+link,Follow up
    1.2> Crawl secondary page,Extract required data+link,Follow up
    1.3> ... ... 

[2]Code implementation ideas
    2.1> Avoid duplicate code - Functions to be defined for request and parsing

UserAgent anti crawling processing

[1]be based on User-Agent Reverse climbing
	1.1) Send request carry request header: headers={'User-Agent' : 'Mozilla/5.0 xxxxxx'}
	1.2) Random handoff in case of multiple requests User-Agent
        a) definition py Large amount of documents User-Agent,Use after import random.choice()Every random selection
        b) use fake_useragent The module is randomly generated every time it accesses User-Agent
           from fake_useragent import UserAgent
           agent = UserAgent().random
   Details: pycharm Download in fake-useragent
        
[2]There are special characters in the response content
	Used in decoding ignore parameter
    html = requests.get(url=url, headers=headers).content.decode('', 'ignore')

Cookie anti crawl

Cookie parameter usage

  1. Form of cookie parameter: Dictionary

    Cookies = {"name of cookie": "value of cookie"}

    • The dictionary corresponds to the Cookie string in the request header, and each pair of dictionary key value pairs is divided by semicolons and spaces
    • To the left of the equal sign is the name of a cookie, which corresponds to the key of the cookie dictionary
    • The right side of the equal sign corresponds to the value of the cookie dictionary
  2. How to use cookie parameters

    response = requests.get(url, cookies)

  3. Dictionary required to convert cookie strings to cookie parameters:

    cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}

  4. Note: cookie s usually have expiration time. Once they expire, they need to be retrieved again

Convert CookieJar object to cookie dictionary

The reposne object obtained by using requests has the cookie attribute. The attribute value is a cookie jar type, which contains the cookies set locally by the other server. How can we convert it into a cookie dictionary?

  1. Conversion method

    cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)

  2. Where response Cookies return objects of type cookie jar

  3. requests. utils. dict_ from_ The cookie jar function returns the cookie dictionary

Summary of requests module parameters

[1]Method 1 : requests.get()
[2]parameter
   2.1) url
   2.2) headers
   2.3) timeout
   2.4) proxies

[3]Method 2: requests.post()
[4]parameter
    data

requests.get()

  • thinking

    [1]url
    [2]proxies -> {}
         proxies = {
            'http':'http://1.1.1.1:8888',
    	    'https':'https://1.1.1.1:8888'
         }
    [3]timeout
    [4]headers
    [5]cookies
    

requests.post()

  • Applicable scenario

    [1]Applicable scenario : Post Type requested web site
    
    [2]parameter : data={}
       2.1) Form Form Data : Dictionaries
       2.2) res = requests.post(url=url,data=data,headers=headers)
      
    [3]POST Request characteristics : Form Form submission data
        
        data : Dictionaries, Form Form Data 
    
  • Regular processing of headers and formdata in pycharm

    [1]pycharm Entry method: Ctrl + r ,Select Regex
    [2]handle headers and formdata
        (.*): (.*)
        "$1": "$2",
    [3]click Replace All
    
  • Classic Demo Youdao translation

request.session()

  • The Session class in the requests module can automatically process the cookie s generated in the process of sending requests and obtaining responses, so as to maintain the state. Next, let's learn it

Role and application scenario

  • requests. Role of session
    • Automatically process cookies, that is, the next request will bring the previous cookie
  • requests. Application scenario of session
    • Automatically handle cookie s generated during successive multiple requests

usage method

After a session instance requests a website, the local Cookie Set by the other server will be saved in the session. The next time it uses the session to request the other server, it will bring the previous cookie

session = requests.session() # Instantiate session object
response = session.get(url, headers, ...)
response = session.post(url, data, ...)
  • The parameters of the get or post request sent by the session object are exactly the same as those sent by the requests module

response

response.text and response The difference between content:

  • response.text
    • Type: str
    • Decoding type: the requests module automatically infers the encoding of the response according to the HTTP header, and infers the text encoding
  • response.content
    • Type: bytes
    • Decoding type: not specified

Dynamic load data grab Ajax

  • characteristic

    [1]Right click -> There is no specific data in the web page source code
    [2]Load when scrolling mouse wheel or other actions,Or partial page refresh
    
  • Grab

    [1]F12 Open the console, and the page action grabs the network packet
    [2]Grab json file URL address
       2.1) In the console XHR : Asynchronously loaded packets
       2.2) XHR -> QueryStringParameters(Query parameters)
    
  • Classic Demo: Douban movie

json parsing module

json.loads(json)

[1]effect : hold json Format string to Python data type

[2]Examples : html = json.loads(res.text)

json.dump(python,f,ensure_ascii=False)

[1]effect
   hold python Data type changed to json Formatted string,Generally let you save the captured data as json Use when file

[2]Parameter description
   2.1) First parameter: python Type of data(Dictionaries, lists, etc)
   2.2) Second parameter: File object
   2.3) 3rd parameter: ensure_ascii=False Serialization time encoding
  
[3]Sample code
    # Example 1
    import json

    item = {'name':'QQ','app_id':1}
    with open('millet.json','a') as f:
      json.dump(item,f,ensure_ascii=False)
  
    # Example 2
    import json

    item_list = []
    for i in range(3):
      item = {'name':'QQ','id':i}
      item_list.append(item)

    with open('xiaomi.json','a') as f:
        json.dump(item_list,f,ensure_ascii=False)

jsonpath

jsonpath usage example

book_dict = { 
  "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

from jsonpath import jsonpath

print(jsonpath(book_dict, '$..author')) # If not, it will return False # Returns the list. If it is not available, it will return False

json module summary

# Reptiles are most commonly used
[1]Data capture - json.loads(html)
    Change the response content from: json Turn into python
[2]Data saving - json.dump(item_list,f,ensure_ascii=False)
    Save the captured data locally json file

# General processing method of grabbing data
[1]txt file
[2]csv file
[3]json file
[4]MySQL database
[5]MongoDB database
[6]Redis database

Console packet capture

  • Opening method and common options

    [1]Open the browser, F12 Open the console and find Network tab 
    
    [2]Common console options
       2.1) Network: Grab network packets
         a> ALL: Grab all network packets
         b> XHR: Grab asynchronously loaded network packets
         c> JS : Grab all JS file
       2.2) Sources: Format output and break point debugging JavaScript Code to help analyze some parameters in the crawler
       2.3) Console: Interactive mode, you can JavaScript Test the code in
        
    [3]After capturing specific network packets
       3.1) Click the network packet address on the left to enter the packet details and view the right
       3.2) right:
         a> Headers: Entire request information
            General,Response Headers,Request Headers,Query String,Form Data
         b> Preview: Preview the response content
         c> Response: Response content
    

Proxy settings

Definition and classification

agent ip Degree of anonymity, proxy IP It can be divided into the following three categories:
    1.Transparent agent(Transparent Proxy): Although transparent agents can directly "hide" your IP Address, but you can still find out who you are.
    2.anonymous proxy (Anonymous Proxy): Using anonymous proxy, others can only know you used the proxy, but can't know who you are.
    3.High concealment agent(Elite proxy or High Anonymity Proxy): High concealment agent makes it impossible for others to find that you are using an agent, so it is the best choice.
    
The protocols used for proxy service requests can be divided into:
	1.http Proxy: target url by http agreement
	2.https Proxy: target url by https agreement
	3.socks Tunnel agent (e.g socks5 Agent, etc.:
      socks The agent simply transmits data packets, and does not care about the application protocol( FTP,HTTP and HTTPS Etc.).
      socks Agent ratio http,https Agent takes less time.
      socks Agent can forward http and https Request for

General agency idea

[1]Get agent IP website
   Western thorn agent, fast agent, whole network agent, agent wizard, Abu cloud and sesame agent... ...

[2]Parameter type
   proxies = { 'agreement':'agreement://IP: port number '}
   proxies = {
    	'http':'http://IP: port number ',
    	'https':'https://IP: port number ',
   }

General agent

# Use free general proxy IP to access the test website: http://httpbin.org/get
import requests

url = 'http://httpbin.org/get'
headers = {'User-Agent':'Mozilla/5.0'}
# Define proxy and find free proxy IP in proxy IP website
proxies = {
    'http':'http://112.85.164.220:9999',
    'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

Private agent + exclusive agent

[1]Grammatical structure
   proxies = { 'agreement':'agreement://User name: password @ IP: port number '}

[2]Examples
   proxies = {
	  'http':'http://User name: password @ IP: port number ',
      'https':'https://User name: password @ IP: port number ',
   }

Private agent + exclusive agent - example code

import requests
url = 'http://httpbin.org/get'
proxies = {
    'http': 'http://309435365:szayclhp@106.75.71.140:16816',
    'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
    'User-Agent' : 'Mozilla/5.0',
}

html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

Establish your own proxy IP pool - open proxy | private proxy

"""
Charging agent:
    Establishment of agent IP pool
 Idea:
    1,Get open proxy
    2,For each agent in turn IP Test,Save what you can use to a file
"""
import requests

class ProxyPool:
    def __init__(self):
        self.url = 'Proxy website API link'
        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}
        # Open the file to store the available proxy IP addresses
        self.f = open('proxy.txt', 'w')

    def get_html(self):
        html = requests.get(url=self.url, headers=self.headers).text
        proxy_list = html.split('\r\n')
        for proxy in proxy_list:
            # Test whether each proxy IP is available in turn
            if self.check_proxy(proxy):
                self.f.write(proxy + '\n')

    def check_proxy(self, proxy):
        """Test 1 agent IP Available,Available return True,Otherwise return False"""
        test_url = 'http://httpbin.org/get'
        proxies = {
            'http' : 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy)
        }
        try:
            res = requests.get(url=test_url, proxies=proxies, headers=self.headers, timeout=2)
            if res.status_code == 200:
                print(proxy,'\033[31m available\033[0m')
                return True
            else:
                print(proxy,'invalid')
                return False
        except:
            print(proxy,'invalid')
            return False

    def run(self):
        self.get_html()
        # Close file
        self.f.close()

if __name__ == '__main__':
    spider = ProxyPool()
    spider.run()

Dragonet Abu cloud agent

import json
import re
import time
import requests
import multiprocessing
from job_data_analysis.lagou_spider.handle_insert_data import lagou_mysql


class HandleLaGou(object):
    def __init__(self):
        #Using session to save cookie information
        self.lagou_session = requests.session()
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
        }
        self.city_list = ""

    #How to get the list of all cities in the country
    def handle_city(self):
        city_search = re.compile(r'www\.lagou\.com\/.*\/">(.*?)</a>')
        city_url = "https://www.lagou.com/jobs/allCity.html"
        city_result = self.handle_request(method="GET",url=city_url)
        #Get a list of cities using regular expressions
        self.city_list = set(city_search.findall(city_result))
        self.lagou_session.cookies.clear()

    def handle_city_job(self,city):
        first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="%city
        first_response = self.handle_request(method="GET",url=first_request_url)
        total_page_search = re.compile(r'class="span\stotalNum">(\d+)</span>')
        try:
            total_page = total_page_search.search(first_response).group(1)
            print(city,total_page)
        #exception due to lack of position information
        except:
            return
        else:
            for i in range(1,int(total_page)+1):
                data = {
                    "pn":i,
                    "kd":"web"
                }
                page_url = "https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false"%city
                referer_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="%city
                #The URL of the referer needs to be encode d
                self.header['Referer'] = referer_url.encode()
                response = self.handle_request(method="POST",url=page_url,data=data,info=city)
                lagou_data = json.loads(response)
                job_list = lagou_data['content']['positionResult']['result']
                for job in job_list:
                    lagou_mysql.insert_item(job)



    def handle_request(self,method,url,data=None,info=None):
        while True:
            #Join the dynamic agent of Abu cloud
            proxyinfo = "http://%s:%s@%s:%s'% ('account', 'password', 'http-dyn.abuyun.com', '9020')
            proxy = {
                "http":proxyinfo,
                "https":proxyinfo
            }
            try:
                if method == "GET":
                    response = self.lagou_session.get(url=url,headers=self.header,proxies=proxy,timeout=6)
                elif method == "POST":
                    response = self.lagou_session.post(url=url,headers=self.header,data=data,proxies=proxy,timeout=6)
            except:
                # You need to clear the cookie information first
                self.lagou_session.cookies.clear()
                # Retrieve cookie information
                first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % info
                self.handle_request(method="GET", url=first_request_url)
                time.sleep(10)
                continue
            response.encoding = 'utf-8'
            if 'frequently' in response.text:
                print(response.text)
                #You need to clear the cookie information first
                self.lagou_session.cookies.clear()
                # Retrieve cookie information
                first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="%info
                self.handle_request(method="GET",url=first_request_url)
                time.sleep(10)
                continue
            return response.text

if __name__ == '__main__':
    lagou = HandleLaGou()
    #Methods for all cities
    lagou.handle_city()
    print(lagou.city_list)
    #Introduce multi process accelerated crawl
    pool = multiprocessing.Pool(2)
    for city in lagou.city_list:
        pool.apply_async(lagou.handle_city_job,args=(city,))
    pool.close()
    pool.join()

unicode and encode

  • unicode: it is used to convert unicode encoding into other encoded strings and unify all languages into unicode [if the content is English, the storage space of unicode encoding is twice that of ASCII, and the transmission needs twice as much transmission at the same time]

  • encode: it is used to convert unicode encoding into other encoded strings

  • decode: it is used to convert other encoded strings into unicode encoding, and the original encoding format needs to be specified

    • Code change:

      s = ("I love China, i love china")
      s.decode("gb2312").encode("utf-8")
      
    • View current code

      import sys
      sys.getdefaultencoding()
      
  • Python 3 is encoded in unicode, so it can be encoded directly in unicode: s.encode("utf-8")

  • Usually, the window is encoded in GB2312

  • linux is usually encoded in utf8

Analysis module summary

re canonical analysis

import re 
pattern = re.compile('regular expression ',re.S)
r_list = pattern.findall(html)

lxml+xpath parsing

from lxml import etree
p = etree.HTML(res.text)
r_list = p.xpath('xpath expression')

[Remember] just call xpath,The result must be'list'

xpath expression

  • Matching rules

    [1]result: Node object list
       1.1) xpath Examples: //div,//div[@class="student"],//div/a[@title="stu"]/span
    
    [2]result: String list
       2.1) xpath The end of the expression is: @src,@href,/text()
    
  • Most commonly used

    [1]benchmark xpath expression: Get a list of node objects
    [2]for r in [Node object list]:
           username = r.xpath('./xxxxxx')
    
    [[note] continue after traversal xpath Be sure to:  . The beginning represents the current node
    
  • Douban Book grab

    import requests
    from lxml import etree
    from fake_useragent import UserAgent
    import time
    import random
    
    class DoubanBookSpider:
        def __init__(self):
            self.url = 'https://book.douban.com/top250?start={}'
    
        def get_html(self, url):
            """Use random User-Agent"""
            headers = {'User-Agent':UserAgent().random}
            html = requests.get(url=url, headers=headers).text
            self.parse_html(html)
    
        def parse_html(self, html):
            """lxml+xpath Data analysis"""
            parse_obj = etree.HTML(html)
            # 1. Benchmark xpath: extract the node object list of each book
            table_list = parse_obj.xpath('//div[@class="indent"]/table')
            for table in table_list:
                item = {}
                # title
                name_list = table.xpath('.//div[@class="pl2"]/a/@title')
                item['name'] = name_list[0].strip() if name_list else None
                # describe
                content_list = table.xpath('.//p[@class="pl"]/text()')
                item['content'] = content_list[0].strip() if content_list else None
                # score
                score_list = table.xpath('.//span[@class="rating_nums"]/text()')
                item['score'] = score_list[0].strip() if score_list else None
                # Number of evaluators
                nums_list = table.xpath('.//span[@class="pl"]/text()')
                item['nums'] = nums_list[0][1:-1].strip() if nums_list else None
                # category
                type_list = table.xpath('.//span[@class="inq"]/text()')
                item['type'] = type_list[0].strip() if type_list else None
    
                print(item)
    
        def run(self):
            for i in range(5):
                start = (i - 1) * 25
                page_url = self.url.format(start)
                self.get_html(page_url)
                time.sleep(random.randint(1,2))
    
    if __name__ == '__main__':
        spider = DoubanBookSpider()
        spider.run()
    

Data persistence

mysql

"""
sql Table creation

mysql -uroot -p
create database maoyandb charset utf8;
use maoyandb;
create table maoyantab(
name varchar(100),
star varchar(300),
time varchar(100)
)charset=utf8;
"""
"""
    pymysql Module use
"""
import pymysql

# 1. Connect to the database
db = pymysql.connect(
    'localhost','root','123456','maoyandb',charset='utf8'
)
cur = db.cursor()
# 2. Execute sql command
ins = 'insert into maoyantab values(%s,%s,%s)'
cur.execute(ins, ['The Shawshank Redemption','Starring: Zhang Guorong,Zhang Manyu,Lau Andy','2018-06-25'])
# 3. Submit to database for execution
db.commit()

cur.close()
db.close()

Mysql template

import pymysql

# __init__(self): 
	self.db = pymysql.connect('IP',... ...)
	self.cursor = self.db.cursor()
	
# save_html(self,r_list):
	self.cursor.execute('sql',[data1])
    self.cursor.executemany('sql', [(),(),()])
	self.db.commit()
	
# run(self):
	self.cursor.close()
	self.db.close()

mongodb

  • https://www.runoob.com/python3/python-mongodb.html [rookie tutorial]
"""
[1]Non relational database,Data is stored in key value pairs, port 27017
[2]MongoDB Disk based storage
[3]MongoDB Single data type,Value is JSON file,and Redis Memory based,
   3.1> MySQL Data type: numeric type, character type, date time type and enumeration type
   3.2> Redis Data type: string, list, hash, set, ordered set
   3.3> MongoDB Data type: value is JSON file
[4]MongoDB: library -> aggregate -> file
     MySQL  : library -> surface  ->  Table record
"""

"""
Linux get into: mongo
>show dbs                  - View all libraries
>use Library name                   - Switch Library
>show collections          - View all collections in the current library
>db.Collection name.find().pretty()  - View documents in collection
>db.Collection name.count()          - Count the number of documents
>db.Collection name.drop()           - Delete collection
>db.dropDatabase()         - Delete current library
"""
import pymongo

# Create connection object
connect = pymongo.MongoClient(host='127.0.0.1',port=27017)
# Connect library objects
db = connect['maoyandb']
# Object collection creation
myset = db['maoyanset']
# Insert a single piece of data
# myset.insert_one({'name': 'money day'})
# Insert multiple pieces of data
myset.insert_many[{'name':'Zhang San'},{'name':'Li Si'},{'name':'Wang Wu'}]

mongodb template

import requests
import re
import time
import random
import pymongo

class MaoyanSpider:
    def __init__(self):
        self.url = 'https://maoyan.com/board/4?offset={}'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'}
        # Three objects: connection object, library object and library collection
        self.conn = pymongo.MongoClient('127.0.0.1',27017)
        self.db = self.conn['maoyandb']
        self.myset = self.db['maoyanset']

    ##Extract web page data
    def get_html(self,url):
        html = requests.get(url,url,headers=self.headers).text
        self.parse_html(html)

    ##Parsing web data
    def parse_html(self,html):
        regex =  '<div class="movie-item-info">.*?title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p>'
        pattern = re.compile(regex,re.S)
        r_list = pattern.findall(html)
        self.save_html(r_list)

    ## Data processing and extraction
    def save_html(self,r_list):
        for r in r_list:
            item ={}
            item['name'] = r[0].strip()
            item['star'] = r[1].strip()
            item['time'] = r[2].strip()
            print(item)
            # Save to mongodb database
            self.myset.insert_one(item)

    def run(self):
        """Program entry function"""
        for offset in range(0, 91, 10):
            url = self.url.format(offset)
            self.get_html(url=url)
            # Control the frequency of data fetching: uniform() generates floating-point numbers within the specified range
            time.sleep(random.uniform(0, 1))

if __name__ == '__main__':
    spider = MaoyanSpider()
    spider.run()

Json

import json
# Demo1
item = {'name':'QQ','e-mail':99362}
with open('qq.json','a')as f:
    json.dump(item,f,ensure_ascii=False)


# Demo2
# import json
# item_list =[]
# for i in range(3):
#     item = {'name':'QQ','e-mail':'99362'}
#     item_list.append(item)
#
# with open('demo2.json','a')as f:
#     json.dump(item_list,f,ensure_ascii=False)

Jason template

import requests
from fake_useragent import UserAgent
import time
import random
import re
import json

# Douban movie full stack capture
class DoubanSpider:
    def __init__(self):
        self.url = 'https://movie.douban.com/j/chart/top_list?'
        self.i = 0
        # Save json file
        self.f = open('douban.json', 'w', encoding='utf-8')
        self.all_film_list = []

    def get_agent(self):
        """Get random User-Agent"""
        return UserAgent().random

    def get_html(self, params):
        headers = {'User-Agent':self.get_agent()}
        html = requests.get(url=self.url, params=params, headers=headers).text
        # Convert the string in json format to python data type
        html = json.loads(html)

        self.parse_html(html)

    def parse_html(self, html):
        """analysis"""
        # html: [{},{},{},{}]
        item = {}
        for one_film in html:
            item['rank'] = one_film['rank']
            item['title'] = one_film['title']
            item['score'] = one_film['score']
            print(item)
            self.all_film_list.append(item)
            self.i += 1

    def run(self):
        # d: 'romantic comedy', '{:' 5 ','...} '
        d = self.get_d()
        # 1. Prompt the user to choose
        menu = ''
        for key in d:
            menu += key + '|'
        print(menu)
        choice = input('Please enter movie category:')
        if choice in d:
            code = d[choice]
            # 2. Total: total number of movies
            total = self.get_total(code)
            for start in range(0,total,20):
                params = {
                    'type': code,
                    'interval_id': '100:90',
                    'action': '',
                    'start': str(start),
                    'limit': '20'
                }
                self.get_html(params=params)
                time.sleep(random.randint(1,2))

            # Store data in json file
            json.dump(self.all_film_list, self.f, ensure_ascii=False)
            self.f.close()
            print('quantity:',self.i)
        else:
            print('Please make the right choice')

    def get_d(self):
        """{'plot':'11','love':'13','comedy':'5',...,...}"""
        url = 'https://movie.douban.com/chart'
        html = requests.get(url=url,headers={'User-Agent':self.get_agent()}).text
        regex = '<span><a href=".*?type_name=(.*?)&type=(.*?)&interval_id=100:90&action=">'
        pattern = re.compile(regex, re.S)
        # r_list: [('plot', '11'), ('comedy ',' 5 '), ('Love':'13 ')...]
        r_list = pattern.findall(html)
        # d: {'plot':'11 ',' love ':'13', 'comedy':'5 ',...,...}
        d = {}
        for r in r_list:
            d[r[0]] = r[1]

        return d

    def get_total(self, code):
        """Gets the total number of movies in a category"""
        url = 'https://movie.douban.com/j/chart/top_list_count?type={}&interval_id=100%3A90'.format(code)
        html = requests.get(url=url,headers={'User-Agent':self.get_agent()}).text
        html = json.loads(html)

        return html['total']

if __name__ == '__main__':
    spider = DoubanSpider()
    spider.run()

json common operations

import json
 
data = {
    'name': 'pengjunlee',
    'age': 32,
    'vip': True,
    'address': {'province': 'GuangDong', 'city': 'ShenZhen'}
}
# Convert Python dictionary type to JSON object
json_str = json.dumps(data)
print(json_str) # Results {"name": "pengjunlee", "age": 32, "vip": true, "address": {"province": "GuangDong", "city": "ShenZhen"}}
 
# Converting JSON object types to Python Dictionaries
user_dic = json.loads(json_str)
print(user_dic['address']) # Result {'province':'guangdong ',' city ':'shenzhen'}
 
# Output Python dictionary directly to file
with open('pengjunlee.json', 'w', encoding='utf-8') as f:
    json.dump(user_dic, f, ensure_ascii=False, indent=4)
 
# Converts the JSON string in the class file object directly to a Python dictionary
with open('pengjunlee.json', 'r', encoding='utf-8') as f:
    ret_dic = json.load(f)
    print(type(ret_dic)) # Results < class' dict '>
    print(ret_dic['name']) # Result pengjunlee

json storage list

import json

all_app_list = [
    {'name':'QQ'},
    {'name':'Glory of Kings'},
    {'name':'Game for Peace'}
]

with open('xiaomi.json', 'w') as f:
    json.dump(all_app_list, f, ensure_ascii=False)

CSV template

import csv
from urllib import request, parse
import re
import time
import random


class MaoyanSpider(object):
  def __init__(self):
    self.url = 'https://maoyan.com/board/4?offset={}'
    # count
    self.num = 0


  def get_html(self, url):
    headers = {
        'user-agent': 'Mozilla / 5.0(Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    }
    req = request.Request(url=url, headers=headers)
    res = request.urlopen(req)
    html = res.read().decode('utf-8')
    # Call parsing function directly
    self.parse_html(html)

  def parse_html(self, html):
    # Create regular compilation objects
    re_ = '<div class="movie-item-info">.*?title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p> '
    pattern = re.compile(re_, re.S)
    # film_list: [('Farewell My Concubine', 'Zhang Guorong', '1993')]
    film_list = pattern.findall(html)
    self.write_html(film_list)

  # Save to csv file - writerows
  def write_html(self, film_list):
    L = []
    with open('maoyanfilm.csv', 'a',newline='') as f:
      # Initialize the write object. Note that the parameter f cannot be forgotten
      writer = csv.writer(f)
      for film in film_list:
        t = (
          film[0].strip(),
          film[1].strip(),
          film[2].strip()[5:15]
        )
        self.num += 1
        L.append(t)
      # The writerow() parameter is a list
      writer.writerows(L)
      print(L)


  def main(self):
    for offset in range(0, 91, 10):
      url = self.url.format(offset)
      self.get_html(url)
      time.sleep(random.randint(1, 2))
    print('Grab data together', self.num, "Department")


if __name__ == '__main__':
  start = time.time()
  spider = MaoyanSpider()
  spider.main()
  end = time.time()
  print('execution time:%.2f' % (end - start))

txt template

Note: if you need to wrap a line, you should add it to the write content yourself \ n

  • Single line write
# Write in binary mode
f  = open("write.txt","wb")
# Append in binary mode
# f  = open("write.txt","ab")
# write in
res = f.write(b"Big Bang Bang\n")
print(res)
res = f.write(b"Big Bang Bang\n")
print(res)
#close
f.close()
  • Multiline write
f  = open("write.txt","w")
# Write information
msg =  f.writelines(["Hello","I'm fine","That's good"])
# close
f.close()

Implementation of redis incremental URl crawler

[1]principle
    utilize Redis Collection feature, which can add the captured fingerprint to redis In the collection, whether to grab is determined according to the return value
	The return value is 1: it means that it has not been captured before and needs to be captured
	The return value is 0: it means that it has been crawled and there is no need to crawl again
    
    hash Encryption: md5 32 Bit sha1 40 position
    
    establish md5 object
    use update take href Turn into byte type
    Finally passed hexdigest Get password
    
[2]Code implementation template
import redis
from hashlib import md5
import sys

class XxxIncrSpider:
  def __init__(self):
    self.r = redis.Redis(host='localhost',port=6379,db=0)
    
  def url_md5(self,url):
    """yes URL conduct md5 Encryption function"""
    s = md5()
    s.update(url.encode())
    return s.hexdigest()
  
  def run_spider(self):
    href_list = ['url1','url2','url3','url4']
    for href in href_list:
      href_md5 = self.url_md5(href)
      if self.r.sadd('spider:urls',href_md5) == 1:
        A return value of 1 indicates that the addition is successful, that is, if it has not been crawled before, it starts to crawl
      else:
        sys.exit()

Sleep settings

# Control data capture frequency: uniform() generates floating-point numbers within the specified range
time.sleep(random.uniform(2,5))
# Generate a shaped array
time.sleep(random.randint(1,2))

Time parameter

import datetime
timestape = str(datetime.datetime.now()).split(".")[0]

Coding platform

  • Atlas
  • Super Eagle

Cloud words

import os,time,json,random,jieba,requests
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Word cloud shape picture
WC_MASK_IMG = 'wawa.jpg'
# Comment data save file
COMMENT_FILE_PATH = 'jd_comment.txt'
# Word cloud font
WC_FONT_PATH = '/Library/Fonts/Songti.ttc'

def spider_comment(page=0):
    """
    Crawl the evaluation data of JD specified page
    :param page: The default value is 0
    """
    url = 'https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv4646&productId=1263013576&score=0&sortType=5&page=%s&pageSize=10&isShadowSku=0&fold=1' % page
    kv = {'user-agent': 'Mozilla/5.0', 'Referer': 'https://item.jd.com/1263013576.html'}
    # proxies = {
    #     '1.85.5.66':'8060',
    #     '171.11.178.223':'9999',
    #     '120.194.42.157':'38185',
    #     '161.35.4.201':'80',
    #     '175.42.123.196':'9999',
    # }
    try:
        r = requests.get(url, headers=kv)#,proxies=proxies
        r.raise_for_status()
    except:
        print('Crawl failed')
    # Intercept json data string
    r_json_str = r.text[26:-2]
    # String to json object
    r_json_obj = json.loads(r_json_str)
    # Get evaluation list data
    r_json_comments = r_json_obj['comments']
    # Traverse the list of comment objects
    for r_json_comment in r_json_comments:
        # Wrap each comment in append mode
        with open(COMMENT_FILE_PATH, 'a+') as file:
            file.write(r_json_comment['content'] + '\n')
        # Print comment content in comment object
        print(r_json_comment['content'])

def batch_spider_comment():
    """
    Batch crawling evaluation
    """
    # Clear the previous data before writing data
    if os.path.exists(COMMENT_FILE_PATH):
        os.remove(COMMENT_FILE_PATH)
    for i in range(3):
        spider_comment(i)
        # Simulate user browsing and set a crawler interval to prevent ip from being blocked
        time.sleep(random.random() * 5)

def cut_word():
    """
    Data segmentation
    :return: Data after word segmentation
    """
    with open(COMMENT_FILE_PATH) as file:
        comment_txt = file.read()
        wordlist = jieba.cut(comment_txt, cut_all=True)
        wl = " ".join(wordlist)
        print(wl)
        return wl

def create_word_cloud():
    """
    Generate word cloud
    :return:
    """
    # Set word cloud shape picture
    wc_mask = np.array(Image.open(WC_MASK_IMG))
    # Set some configurations of word cloud, such as font, background color, word cloud shape and size
    wc = WordCloud(background_color="white", max_words=2000, mask=wc_mask, scale=4,
                   max_font_size=50, random_state=42, font_path=WC_FONT_PATH)
    # Generate word cloud
    wc.generate(cut_word())

    # If you only set the mask, you will get a word cloud with picture shape
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.figure()
    plt.show()

if __name__ == '__main__':
    # Crawling data
    batch_spider_comment()

    # Generate word cloud
    create_word_cloud()

Multithreaded crawler

Application scenario

[1]Multi process: CPU Intensive program
[2]Multithreading: crawler(network I/O),Local disk I/O

Multithreaded crawler example [watercress]

# Capture movie information under Douban movie plot category
"""
Douban film - plot - Grab
"""
import requests
from fake_useragent import UserAgent
import time
import random
from threading import Thread,Lock
from queue import Queue

class DoubanSpider:
    def __init__(self):
        self.url = 'https://movie.douban.com/j/chart/top_list?type=13&interval_id=100%3A90&action=&start={}&limit=20'
        self.i = 0
        # Queue + lock
        self.q = Queue()
        self.lock = Lock()

    def get_agent(self):
        """Get random User-Agent"""
        return UserAgent().random

    def url_in(self):
        """Take everything you want to grab URL Address in queue"""
        for start in range(0,684,20):
            url = self.url.format(start)
            # url in queue
            self.q.put(url)

    # Thread event function: request + parsing + data processing
    def get_html(self):
        while True:
            # Get URL address from queue
            # Judge whether there is only one address left in the queue when get() and get() are empty
            self.lock.acquire()
            if not self.q.empty():
                headers = {'User-Agent': self.get_agent()}
                url = self.q.get()
                self.lock.release()

                html = requests.get(url=url, headers=headers).json()
                self.parse_html(html)
            else:
                # If the queue is empty, the lock must eventually be released
                self.lock.release()
                break

    def parse_html(self, html):
        """analysis"""
        # html: [{},{},{},{}]
        item = {}
        for one_film in html:
            item['rank'] = one_film['rank']
            item['title'] = one_film['title']
            item['score'] = one_film['score']
            print(item)
            # Lock + release lock
            self.lock.acquire()
            self.i += 1
            self.lock.release()

    def run(self):
        # Put the URL address into the queue first
        self.url_in()
        # Create multiple threads and get started
        t_list = []
        for i in range(1):
            t = Thread(target=self.get_html)
            t_list.append(t)
            t.start()

        for t in t_list:
            t.join()

        print('quantity:',self.i)

if __name__ == '__main__':
    start_time = time.time()
    spider = DoubanSpider()
    spider.run()
    end_time = time.time()
    print('execution time:%.2f' % (end_time-start_time))

selenium+PhantomJS/Chrome/Firefox

selenium

[1]definition
    1.1) GPl  Web Automated test tools
    
[2]purpose
    2.1) yes Web Functional test of the system,Avoid repetitive work during version iteration
    2.2) Compatibility test(test web Whether the program runs normally in different operating systems and different browsers)
    2.3) yes web The system is tested in large quantities
    
[3]characteristic
    3.1) The browser can be manipulated according to instructions
    3.2) It is only a tool and must be used in combination with a third-party browser
    
[4]install
    4.1) Linux: sudo pip3 install selenium
    4.2) Windows: python -m pip install selenium

Bypass selenium detection

from selenium import webdriver
options = webdriver.ChromeOptions()
# This step is very important. Set it to the developer mode to prevent being recognized by major websites for using Selenium
options.add_experimental_option('excludeSwitches', ['enable-automation'])
#Stop loading pictures
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
browser = webdriver.Chrome(options=options)
browser.get('https://www.taobao.com/')

selenium's handling of cookie s

selenium can help us deal with cookie s in the page, such as obtaining and deleting. Next, we will learn this knowledge

Get cookie

driver.get_cookies() returns a list containing complete cookie information! There are not only name and value, but also cookies such as domain and other dimensions of information. Therefore, if you want to use the obtained cookie information with the requests module, you need to convert it into a cookie dictionary with name and value as key value pairs

# Get all cookie information of the current tab
print(driver.get_cookies())
# Convert cookie s into Dictionaries
cookies_dict = {cookie['name']: cookie['value'] for cookie in driver.get_cookies()}

delete cookie

#Delete a cookie
driver.delete_cookie("CookieName")

# Delete all cookie s
driver.delete_all_cookies()

selenium uses proxy ip

selenium control browser can also use proxy ip!

  • Method of using proxy ip

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • The configuration object adds commands that use proxy ip
      • options.add_argument('--proxy-server=http://202.20.16.82:9527')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • Reference codes are as follows:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() # Create a configuration object
    options.add_argument('--proxy-server=http://202.20.16.82:9527 ') # using proxy ip
    
    driver = webdriver.Chrome(chrome_options=options) # Instantiate the driver object with configuration
    
    driver.get('http://www.itcast.cn')
    print(driver.title)
    driver.quit()
    

selenium replaces user agent

When selenium controls Google browser, the user agent defaults to Google browser. In this section, we will learn to use different user agents

  • Method of replacing user agent

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Add and replace UA command for configuration object
      • options.add_argument('--user-agent=Mozilla/5.0 HAHA')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • Reference codes are as follows:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() # Create a configuration object
    options.add_argument('--user-agent=Mozilla/5.0 HAHA') # Replace user agent
    
    driver = webdriver.Chrome('./chromedriver', chrome_options=options)
    
    driver.get('http://www.itcast.cn')
    print(driver.title)
    driver.quit()
    
  • selenium replaces user agent

Specify port remote control

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('debuggerAddress','127.0.0.1:9222')
browser=webdriver.Chrome(executable_path=r'C:\Users\TR\AppData\Local\Google\Chrome
\Application\chromedriver.exe',chrome_options=chrome_options)
browser.get('http://www.zhihu.com')

Phantom JS browser

[1]definition
    phantomjs No interface browser(Also known as headless browser),Page loading in memory,Efficient
 
[2]Download address
    2.1) chromedriver : Download the corresponding version
       http://npm.taobao.org/mirrors/chromedriver/
    
    2.2) geckodriver
       https://github.com/mozilla/geckodriver/releases
            
    2.3) phantomjs
       https://phantomjs.org/download.html

[3]Ubuntu install
    3.1) Unzip after downloading : tar -zxvf geckodriver.tar.gz 
        
    3.2) Copy the extracted file to /usr/bin/ (Add environment variable)
         sudo cp geckodriver /usr/bin/
        
    3.3) Add executable permissions
         sudo chmod 777 /usr/bin/geckodriver

[4]Windows install
    4.1) Download the corresponding version of phantomjs,chromedriver,geckodriver
    4.2) hold chromedriver.exe copy to python Of the installation directory Scripts Under the directory(Add to system environment variable)
         # View python installation path: where python
    4.3) verification
         cmd command line: chromedriver
 
***************************summary**************************************
[1]decompression - Put in user home directory(chromedriver,geckodriver,phantomjs)
[2]Copy - sudo cp /home/tarena/chromedriver /usr/bin/
[3]jurisdiction - sudo chmod 777 /usr/bin/chromedriver

# verification
[Ubuntu | Windows]
ipython3
from selenium import webdriver
webdriver.Chrome()
perhaps
webdriver.Firefox()

[mac]
ipython3
from selenium import webdriver
webdriver.Chrome(executable_path='/Users/xxx/chromedriver')
perhaps
webdriver.Firefox(executable_path='/User/xxx/geckodriver')

Baidu sample code

"""Example code 1: use selenium+Open Baidu browser"""

# Import the webdriver interface of seleinum
from selenium import webdriver
import time

# Create browser object
browser = webdriver.Chrome()
browser.get('http://www.baidu.com/')
# Close the browser after 5 seconds
time.sleep(5)
browser.quit()
"""Example code 2: open Baidu, search Zhao Liying, click search and view"""

from selenium import webdriver
import time

# 1. Create browser object - the browser has been opened
browser = webdriver.Chrome()
# 2. Input: http://www.baidu.com/
browser.get('http://www.baidu.com/')
# 3. Find the search box and send the text to this node: Zhao Liying
browser.find_element_by_xpath('//*[@id="kw"]').send_keys('zhao Liying ')
# 4. Find the baidu button and click it
browser.find_element_by_xpath('//*[@id="su"]').click()

Browser object method

[1]browser.get(url=url)   - Address field input url Address and confirm   
[2]browser.quit()         - Close browser
[3]browser.close()        - Close current page
[4]browser.page_source    - HTML Structure source code
[5]browser.page_source.find('character string')
    from html Search the specified string in the source code,Return not found:-1,Often used to determine whether it is the last page
[6]browser.maximize_window() - Maximize browser window

Eight methods of Selenium locating nodes

[1]Single element search('The result is 1 node object')
    1.1) [[most commonly used] browser.find_element_by_id('id Attribute value')
    1.2) [[most commonly used] browser.find_element_by_name('name Attribute value')
    1.3) [[most commonly used] browser.find_element_by_class_name('class Attribute value')
    1.4) [[the most versatile] browser.find_element_by_xpath('xpath expression')
    1.5) [matching a Node [common] browser.find_element_by_link_text('Link text')
    1.6) [matching a Node [common] browser.find_element_by_partical_link_text('Partial link text')
    1.7) [Most useless] browser.find_element_by_tag_name('Tag name')
    1.8) [[more commonly used] browser.find_element_by_css_selector('css expression')

[2]Multi element search('The result is[Node object list]')
    2.1) browser.find_elements_by_id('id Attribute value')
    2.2) browser.find_elements_by_name('name Attribute value')
    2.3) browser.find_elements_by_class_name('class Attribute value')
    2.4) browser.find_elements_by_xpath('xpath expression')
    2.5) browser.find_elements_by_link_text('Link text')
    2.6) browser.find_elements_by_partical_link_text('Partial link text')
    2.7) browser.find_elements_by_tag_name('Tag name')
    2.8) browser.find_elements_by_css_selector('css expression')

Cat's eye movie example

from selenium import webdriver
import time

url = 'https://maoyan.com/board/4'
browser = webdriver.Chrome()
browser.get(url)

def get_data():
    # Benchmark XPath: [< selenium XXX Li at XXX >, < selenium XXX Li at >]
    li_list = browser.find_elements_by_xpath('//*[@id="app"]/div/div/div[1]/dl/dd')
    for li in li_list:
        item = {}
        # info_list: ['1', 'Farewell My Concubine', 'Starring: Zhang Guorong', 'release time: January 1, 1993', '9.5']
        info_list = li.text.split('\n')
        item['number'] = info_list[0]
        item['name'] = info_list[1]
        item['star'] = info_list[2]
        item['time'] = info_list[3]
        item['score'] = info_list[4]

        print(item)

while True:
    get_data()
    try:
        browser.find_element_by_link_text('next page').click()
        time.sleep(2)
    except Exception as e:
        print('congratulations!Grab end')
        browser.quit()
        break
  • Node object operation

    [1]Text box operation
        1.1) node.send_keys('')  - Send content to text box
        1.2) node.clear()        - Empty text
        1.3) node.get_attribute('value') - Get text content
        
    [2]Button operation
        1.1) node.click()      - click
        1.2) node.is_enabled() - Determine whether the button is available
        1.3) node.get_attribute('value') - Get button text
    

chromedriver set no interface mode

from selenium import webdriver

options = webdriver.ChromeOptions()
# Add no interface parameters
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)

selenium - mouse operation

from selenium import webdriver
# Import mouse event class
from selenium.webdriver import ActionChains

driver = webdriver.Chrome()
driver.get('http://www.baidu.com/')

# When moving to the setting, perform() is the real operation, and there must be
element = driver.find_element_by_xpath('//*[@id="u1"]/a[8]')
ActionChains(driver).move_to_element(element).perform()

# Click to pop up the Ajax element and find it according to the text content of the link node
driver.find_element_by_link_text('Advanced search').click()

selenium handles iframe backcrawl

  • Classic cases: Civil Affairs Bureau capture, QQ email login, 163 email login
[1]Mouse operation
    from selenium.webdriver import ActionChains
    ActionChains(browser).move_to_element(node).perform()

[2]Switch handle
    all_handles = browser.window_handles
    time.sleep(1)
    browser.switch_to.window(all_handles[1])

[3]iframe Subframe
    browser.switch_to.frame(iframe_element)
    # Script 1 - any scene can: 
    iframe_node = browser.find_element_by_xpath('')
    browser.switch_to.frame(iframe_node)
    
    # Writing 2 - two attribute values of id and name are supported by default:
    browser.switch_to.frame('id Attribute value|name Attribute value')

Topics: Python crawler