Scrapy weather forecast practice of Python crawler

Posted by jolly on Fri, 20 Mar 2020 11:00:24 +0100

objective

Write a real crawler, and save the crawled data to txt, json and existing mysql database.
PS note: many people will encounter various problems in the process of learning python. No one can easily give up. For this reason, I have built a python full stack for free answer. Skirt: seven clothes, 977 bars and five (homophony of numbers) can be found under conversion. The problems that I don't understand are solved by the old driver and the latest Python practical course. We can supervise each other and make progress together!

Objective analysis:

This time we are going to climb China weather net: http://www.weather.com.cn/
Click the weather of a city, such as Hefei: http://www.weather.com.cn/weather/101220101.shtml
What we need to climb is the picture: Hefei's seven day forecast:

 

Data filtering:

We use the chrome developer tool to simulate mouse positioning to the corresponding location:

 

You can see the data we need, all wrapped in

<ul class="t clearfix">

in
We use bs4, xpath, css and other selectors to locate here and filter the data.
Based on the principle of learning new knowledge, the code in this article will use xpath to locate.
Here we can:

response.xpath('//ul[@class="t clearfix"]')

Implementation of the Scrapy framework:

  1. To create a scratch project and crawler:

    $ scrapy startproject weather
    $ cd weather
    $ scrapy genspider HFtianqi www.weather.com.cn/weather/101220101.shtml
    

    In this way, we have finished the preparatory work.
    Take a look at the current directory:

    .
    ├── scrapy.cfg
    └── weather
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-36.pyc
        │   └── settings.cpython-36.pyc
        ├── items.py
        ├── middlewares.py
        ├── pipelines.py
        ├── settings.py
        └── spiders
            ├── HFtianqi.py
            ├── __init__.py
            └── __pycache__
                └── __init__.cpython-36.pyc
    
    4 directories, 11 files
    
  2. Write items.py:

    This time, let's write items first. It's very simple. Just fill in the field name you want to get:

    import scrapy
        
    class WeatherItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        date = scrapy.Field()
        temperature = scrapy.Field()
        weather = scrapy.Field()
        wind = scrapy.Field()
    
  3. Write Spider:

    This part makes the core of our whole reptile!!

    The main objectives are:

    Send Downloader to our Response to filter data and return it to PIPELINE for processing

    Let's look at the code:

    # -*- coding: utf-8 -*-
    import scrapy
        
    from weather.items import WeatherItem
        
    class HftianqiSpider(scrapy.Spider):
        name = 'HFtianqi'
        allowed_domains = ['www.weather.com.cn/weather/101220101.shtml']
        start_urls = ['http://www.weather.com.cn/weather/101220101.shtml']
        
        def parse(self, response):
            '''
            //Function to filter information:
            date = date
            temperature = Temperature of the day
            weather = Weather of the day
            wind = Wind direction of the day
            '''
        
            # First, create a list to save daily information
            items = []
        
            # Find the div with weather information
            day = response.xpath('//ul[@class="t clearfix"]')
        
            # Loop through the daily information:
            for i  in list(range(7)):
                # First apply for a weatheritem type to save the results
                item = WeatherItem()
            
                # Look at the web page and find the data you need
                item['date'] = day.xpath('./li['+ str(i+1) + ']/h1//text()').extract()[0]
        
                item['temperature'] = day.xpath('./li['+ str(i+1) + ']/p[@class="tem"]/i/text()').extract()[0]
                
                item['weather'] = day.xpath('./li['+ str(i+1) + ']/p[@class="wea"]/text()').extract()[0]
                
                item['wind'] = day.xpath('./li['+ str(i+1) + ']/p[@class="win"]/em/span/@title').extract()[0] + day.xpath('./li['+ str(i+1) + ']/p[@class="win"]/i/text()').extract()[0]
                
                items.append(item)
                
            return items
    
  4. Write PIPELINE:

    We know that pipelines.py is used to process the data captured by the finishing crawler,
    In general, we store data locally:

    • Text form: the most basic storage method
    • json format: easy to call
    • Database: the storage method selected when the data volume is large

    TXT (text) format:

    import os
    import requests
    import json
    import codecs
    import pymysql
        
    class WeatherPipeline(object):
        def process_item(self, item, spider):
        
            print(item)
            # print(item)
            # Get current working directory
            base_dir = os.getcwd()
            # The file exists in the weather.txt file under the data directory. The data directory and the txt file need to be established in advance
            filename = base_dir + '/data/weather.txt'
        
            # Open the file by appending from memory and write the corresponding data
            with open(filename, 'a') as f:
                f.write(item['date'] + '\n')
                f.write(item['temperature'] + '\n')
                f.write(item['weather'] + '\n')
                f.write(item['wind'] + '\n\n')
        
            return item
    

    json format data:

    We want to output json format data. The most convenient thing is to customize a class in PIPELINE:

    class W2json(object):
        def process_item(self, item, spider):
            '''
            //Save crawled information to json
            //Convenient for other programmers to call
            '''
            base_dir = os.getcwd()
            filename = base_dir + '/data/weather.json'
        
            # Open the json file and inhale the data in the way of dumps
            # Note that there needs to be a parameter, ensure Υ ASCII = false, otherwise the data will be stored directly in utf encoding mode, such as: "/ xe15"
            with codecs.open(filename, 'a') as f:
                line = json.dumps(dict(item), ensure_ascii=False) + '\n'
                f.write(line)
        
            return item
    

    Database format (mysql):

    Python has good support for all kinds of database operations on the market,
    But now mysql, a free database, is commonly used.

    • To install mysql locally:

      Both linux and mac have powerful package management software, such as apt, brew, etc

      Windows can download the installation package directly from the official website.

      Since I'm a Mac, I mean the way the MAC is installed.

      $ brew install mysql
      

      During the installation, he will ask you to fill in the password of the root user,

      root here is not the super user at the system level, but the super user of mysql database.
      After installation, mysql service is started by default,
      If you restart your computer, you need to start it like this (mac):

      $ mysql.server start
      
    • Log in to mysql and create a database for the summary:

      # Log in to mysql
      $ mysql -uroot -p
      
      # Create database: ScrapyDB, in utf8 bit encoding format, each statement ends with ';'
      CREATE DATABASE ScrapyDB CHARACTER SET 'utf8';
      
      # Select the table you just created:
      use ScrapyDB;
      
      # Create the fields we need: the fields should correspond to our code one by one, so that we can write sql statements later
      CREATE TABLE weather(
      id INT AUTO_INCREMENT,
      date char(24),
      temperature char(24),
      weather char(24),
      wind char(24),
      PRIMARY KEY(id) )ENGINE=InnoDB DEFAULT CHARSET='utf8'
      

      Let's see what the weather table looks like:

      show columns from weather
      //Or: desc weather
      
    • Install mysql module of Python:

      pip install pymysql
      

      Finally, we edit the following code:

      class W2mysql(object):
          def process_item(self, item, spider):
              '''
              //Save the crawled information to mysql
              '''
      
              # Take out the data in the item
              date = item['date']
              temperature = item['temperature']
              weather = item['weather']
              wind = item['wind']
      
              # Establish a connection with the local scrapyDB database
              connection = pymysql.connect(
                  host='127.0.0.1',  # Connected to local database
                  user='root',        # Own mysql user name
                  passwd='********',  # Own password
                  db='ScrapyDB',      # Database name
                  charset='utf8mb4',     # Default encoding method:
                  cursorclass=pymysql.cursors.DictCursor)
      
              try:
                  with connection.cursor() as cursor:
                      # Create sql statement to update value
                      sql = """INSERT INTO WEATHER(date,temperature,weather,wind)
                              VALUES (%s, %s, %s, %s)"""
                      # Execute sql statement
                      # The second parameter of extract can complete the sql default statement, generally in the format of tuple
                      cursor.execute(
                          sql, (date, temperature, weather, wind))
      
                  # Submit the record inserted this time
                  connection.commit()
              finally:
                  # Close connection
                  connection.close()
      
              return item
      
  5. Write Settings.py

    We need to add our PIPELINE in Settings.py,
    It's only when you can run

    Only one item in dict format is needed here,
    The number value can be customized, and the smaller the number, the higher the priority

    BOT_NAME = 'weather'
    
    SPIDER_MODULES = ['weather.spiders']
    NEWSPIDER_MODULE = 'weather.spiders'
    
    ROBOTSTXT_OBEY = True
    
    ITEM_PIPELINES = {
       'weather.pipelines.WeatherPipeline': 300,
       'weather.pipelines.W2json': 400,
       'weather.pipelines.W2mysql': 300,
    }
    
  6. Get the project running:

    $ scrapy crawl HFtianqi
    
  7. Results:

    Text format:

     

    json format:

     

    Database format:

     

This is the case. It mainly introduces how to save the crawled data in different ways by customizing PIPELINE. Note: many people will encounter all kinds of problems in the process of learning python. No one can easily give up. For this reason, I have built a python full stack for free answer. Skirt: seven clothes, 977 bars and five (homophony of numbers) can be found under conversion. The problems that I don't understand are solved by the old driver and the latest Python practical course. We can supervise each other and make progress together!

The text and pictures of this article come from the Internet and my own ideas. They are only for learning and communication. They have no commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Topics: Python MySQL Database JSON