Scripy framework: integrating Selenium into scripy

Posted by dloeppky on Wed, 19 Jan 2022 12:47:26 +0100

1, Overview of overall functions

In the process of daily learning, it is found that not all web pages can be captured by Scrapy. The reason is that JavaScript is dynamically rendered and Selenium is used to simulate browser crawling. There is no need to care about background requests or analyze the rendering process. As long as the content that can be seen on the final page of the web page can be captured. The web page used in this experiment is lianjia.com. Using scratch + Selenium, the content of the captured data is the housing source information of second-hand houses in Kunming, including the name, region, orientation, area, structure, selling price and unit selling price of the community where the housing source is located. Store the data into MySQL database, count the number of second-hand houses in different regions in groups, screen out the regions with a total number of more than 100, and draw a histogram with pyechards to realize data visualization. Finally, based on this, this paper analyzes the economic development level and population flow in different regions.

2, Specific implementation steps

1. Install the chrome driver and configure the environment variables

(1) View chrome version( chrome://version/ )

(2) Install the corresponding version of the drive chromdriver, and select the corresponding version and the corresponding operating system (the 64 bit operating system is compatible with 32 bits)

(3) Configure environment variables

(4) Install Selenium and test the chrome driver

2. Building the basic framework

Create a virtual environment, enter the virtual environment to install the required libraries, create a project, enter the project to create a crawler (the above operations are completed in the black window), and create a new main Py as the entry file.
Scratch framework (2): entry file

3. Integrate Selenium into scripy (download middleware middlewares.py)

Every request sent will go through the download middleware. In order to integrate Selenium into scripy, it is necessary to do some interception in this part to enable Selenium to access the web page according to the corresponding url, and respond the result back to the framework (analyze the data in the crawler logic) without using the downloader.

4. Open and close the browser (crawler logic)

It's easy to open the browser, but when do you close the browser? Therefore, the Signals signal is introduced to generate a signal to inform the occurrence of things. The corresponding operation can be done by capturing the corresponding signal.
Scrapy official document Signals page

5. Parse data (crawler logic)

Parse is an analytic function. CSS selector is used to filter the target data, i.e. the name, region, orientation, area, structure, selling price and unit selling price of the community where the house source is located. zip is used to package the information of each house source into a tuple, which is passed into pipelines through the engine Py for data processing and other operations. The purpose of the parsing function is not only to obtain the target data, but also to obtain the url of the next page. Understand the principle of Scrapy. Each parsed response has a corresponding url. On this basis, a new url is obtained through string splicing and passed back to the crawler logic.
Scratch framework (3): CSS selector parsing data

6. Save to database (pipelines.py)

Use pymysql to connect MySQL and pycharm. Use of pymysql (connection between pycharm and mysql)

7. Find the data used to plot the histogram

Filter the data with the total number of second-hand houses greater than 100 by region.

8. Pyecarts drawing

Scrapy framework (5): page turning operation, database storage and simple data visualization

3, Specific code

1. Entry file

import os  # Locate the current directory with the os module
import sys
from scrapy.cmdline import execute
sys.path.append(os.path.dirname(os.path.abspath(__file__)))   # __ file__ Is the current file
execute(["scrapy","crawl","LianjiaSpider"])

2. Reptile logic

from urllib.parse import urljoin
import scrapy
from scrapy import signals
from selenium import webdriver

from FinalProject.items import FinalprojectItem

class LianjiaspiderSpider(scrapy.Spider):
    name = 'LianjiaSpider'
    allowed_domains = ['lianjia.com']
    start_urls = ['https://km.lianjia.com/ershoufang/pg1/']


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(LianjiaspiderSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider.chrome=webdriver.Chrome()
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        spider.chrome.quit()
        print("The reptile is over!!!!!!")

    def parse(self, response):
        name=response.css("div.positionInfo a:nth-child(2)::text").extract()   # Cell name
        price = response.css("div.totalPrice span::text").extract()  # Price per suite
        message = response.css("div.houseInfo::text").extract()  # Information about each suite
        position=response.css("div.positionInfo a:nth-child(3)::text").extract()  # District Name
        uniprice=response.css("div.unitPrice span::text").extract()  # Unit price per square meter

        room_nums=[]  # How many rooms and halls is the house
        area=[]  # Area of the house
        ori=[]  # The orientation of the house
        for num in message:
            room_nums.append(num.split("|")[0])
            area.append(num.split("|")[1])
            ori.append(num.split("|")[2])

        Infos=zip(name,price,room_nums,area,ori,position,uniprice)
        finalprojectItem=FinalprojectItem()
        for Info in Infos:
            finalprojectItem["name"] = Info[0]
            finalprojectItem["price"] = Info[1]
            finalprojectItem["room_nums"] = Info[2]
            finalprojectItem["area"] = Info[3]
            finalprojectItem["ori"] = Info[4]
            finalprojectItem["position"]=Info[5]
            finalprojectItem["uniprice"] = Info[6]
            yield finalprojectItem

        next=response.url
        next_page=int(next.strip('/').split('pg')[1]) + 1
        url=urljoin(response.url,"/ershoufang/pg"+str(next_page))
        if next_page==101:
            return 0
        yield scrapy.Request(url, callback=self.parse)  # The difference between callback and call: parse() and parse

3,items.py

import scrapy


class FinalprojectItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    position = scrapy.Field()
    uniprice = scrapy.Field()
    room_nums =scrapy.Field()
    area = scrapy.Field()
    ori = scrapy.Field()
    pass

4. Download Middleware

from scrapy.http import HtmlResponse

class SeleniumMiddleware:

    def process_request(self, request, spider):
        url = request.url
        # self.chrome.get(url)
        # html=self.chrome.page_source
        # print(html)
        spider.chrome.get(url)
        html=spider.chrome.page_source
        return HtmlResponse(url=url,body=html,request=request,encoding="utf-8")

5,pipelins.py

import pymysql

class FinalprojectPipeline:
    def __init__(self):  # Connect to database
        self.conn = pymysql.connect(host='127.0.0.1', database='lianjia', user='root', password='123456')
        self.cursor=self.conn.cursor()

    def process_item(self, item, spider):
        with open("ershoufang.json","a+")as f:  # Exchange format of front-end and back-end data
            f.write(str(item._values))




        insert_sql = "insert into info (name,price,room_num,area,ori,position,uniprice) values (%s,%s,%s,%s,%s,%s,%s)"

        # Perform the insert data into the database operation
        self.cursor.execute(insert_sql, (item['name'], item['price'], item['room_nums'], item['area'],item['ori'],item['position'],item['uniprice']))
        # Cannot save to database without submitting
        self.conn.commit()

        return item

6. Visualization py

import pymysql

def select():
    nums = []
    positions=[]
    connect = pymysql.connect(host='127.0.0.1', database='lianjia', user='root', password='123456')
    sql = "select position,count(*) from info group by position having count(*)>100"
    cursor = connect.cursor()
    cursor.execute(sql)
    for row in cursor.fetchall():
        position = row[0]
        num = row[1]
        nums.append(num)
        positions.append(position)
    cursor.close()
    connect.close()

    # print(positions)
    # print(nums)
    return positions,nums

7. Pyechards mapping

#coding=gbk
from pyecharts.charts import Bar    #Histogram
from pyecharts import options   #Title Set
from FinalProject.visualization import select

position,num = select() #Get database data

bar = Bar()

#Add title
bar.set_global_opts(
    title_opts=options.TitleOpts(title="Histogram",subtitle="Distribution of second-hand houses in various regions"),
)

#x-axis display area
bar.add_xaxis(position)
#y-axis display quantity
bar.add_yaxis("quantity",num)
# Generated html file
bar.render("positionInfo.html")

Topics: Python Pycharm Selenium crawler