26 data analysis cases - the fifth stop: data collection based on the Scrapy architecture

Posted by rar_ind on Sat, 25 Dec 2021 00:55:21 +0100

26 data analysis cases - the fifth stop: data collection based on the Scrapy architecture

Case environment

  • Python: Python 3.x;

Data description

  • title: Course title
  • image_url: Title picture address.
  • properties: course nature.
  • Stage: course stage.
  • enrollment: number of course applicants.

Data package

Link: https://pan.baidu.com/s/1-DUUUAOfpC4Gs5DAaHcgUg
Extraction code: 5u6s

Experimental steps

Step 1: PAGE analysis

Before crawling the data in a web page, we must first learn how to analyze the structure of the page. All the data we want are in that tag. Only by fully understanding the overall structure of the page can we crawl the data effectively and quickly., This actual combat case will crawl the page data of Mu class.
1. First open the browser and enter the web address http://www.imooc.com/course/list
Enter muke.com

2. Right click in any blank area of the page to display the tool menu. The effect is:

3. Click view web page source code to view the data structure in the web page source code and analyze the structure

As can be clearly seen from the figure,

  • The details of each course are contained in an a tag named "item free",
  • The course title is contained in the p tag with class title ellipsis2;
  • The current stage of the course and the total number of applicants are included in the p tag with class one;
  • The course nature is included in the span tag with class "1";
  • The title picture address is in the style attribute value of the div tag with class "img".
    So far, we have a certain understanding of the data structure in the page. Let's start data collection.

The second step is to write code to collect data

1. Open the command window and create a Scrapy crawler project named "InternetWorm". The command is as follows.

scrapy startproject InternetWorm

The result is:

2. After the project is created, enter the root directory of the InternetWorm project. The code is as follows.

cd InternetWorm
scrapy genspider MySpider www.imooc.com/course/list

The results are shown in the figure.

Open myspider.com in the root directory Py and write the following code.

# -*- coding: utf-8 -*-
import scrapy
class MyspiderSpider(scrapy.Spider):
    name = 'MySpider'
    allowed_domains = ['imooc.com']
    start_urls = ['http://www.imooc.com/course/list/']
    def parse(self, response):
        pass

3. Open items Py, create a class named "CourseItem" in the change file, and define the relevant fields to crawl data.

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
class CourseItem(scrapy.Item):
    # Course title
    title=scrapy.Field();
    # Title picture address
    image_url=scrapy.Field();
    # Course nature
    properties=scrapy.Field();
    # Course stage
    stage=scrapy.Field();
    # Course enrollment
    enrollment=scrapy.Field();

4. Open myspider Py file, modify allowed_domains (the domain name list is set to imooc.com), then import the Selector, parse the Response object, and use XPath to select all the list contents. The code is as follows.

# -*- coding: utf-8 -*-
import scrapy
# Import selector
from scrapy.selector import Selector
class MyspiderSpider(scrapy.Spider):
    name = 'MySpider'
    allowed_domains = ['imooc.com']
    start_urls = ['http://www.imooc.com/course/list/']
    def parse(self, response):
        sel = Selector(response)
        # Use xpath to select all the list contents
        sels = sel.xpath('//a[@class="item free "]')
        print(sels)
        pass

5. Run crawler

scrapy crawl MySpider


6. The program can run normally, indicating that the above steps are configured correctly. Open myspider Py file, put items Py, import and instantiate a new information storage container, and finally get the content and assign it to this container through traversal,

# -*- coding: utf-8 -*-
import scrapy
# Import selector
from scrapy.selector import Selector
# Import items Class defined in py file
from InternetWorm.items import CourseItem
class MyspiderSpider(scrapy.Spider):
    name = 'MySpider'
    allowed_domains = ['imooc.com']
    start_urls = ['http://www.imooc.com/course/list/']
    def parse(self, response):
        sel = Selector(response)
        # Use xpath to select all the list contents
        sels = sel.xpath('//a[@class="item free "]')
        # Instance a container to store crawling information
        item = CourseItem()
        # Traverse all lists
        for box in sels:
            # Get Course Title
            item['title'] = box.xpath('.//p[@class="title ellipsis2"]/text()').extract()[0].strip()
            print("Course title:",item['title'])
            # Get Title picture address
            item['image_url'] = box.xpath('.//div[@class="img"]/@style').extract()[0].strip()[23:-2]
            print("Title picture address:","http:"+item['image_url'])
            # Get the nature of the course
            item['properties'] = box.xpath('.//span[@class="l"]/text()').extract()[0].strip()
            print("Course nature:",item['properties'])
            # Get course phase
            item['stage'] = box.xpath('.//p[@class="one"]/text()').extract()[0].strip().split(" · ")[0]
            print("Course stage:",item['stage'])
            # Get course enrollment
            item['enrollment'] = box.xpath('.//p[@class="one"]/text()').extract()[0].strip().split(" · ")[1][:-3]
            print("Number of applicants:",item['enrollment'])
        pass


Rerun the project.

scrapy crawl MySpider

7. If you get the same result as me, congratulations on successfully crawling to the data in the web page. However, only one page of data can be read. Next, we will continue to modify the code to realize automatic page turning. Here, xpath is used to judge whether there is a label containing the next page in the page number label. If so, the label content will be obtained, and then combined into the web page address of the next page and passed through the script The request () method returns the parse() function and continues to crawl the course information of the page until the next page label does not exist in the page number label. Modify myspider Py file

# -*- coding: utf-8 -*-
import scrapy
# Import selector
from scrapy.selector import Selector
# Import items Class defined in py file
from InternetWorm.items import CourseItem
pageIndex = 0
class MyspiderSpider(scrapy.Spider):
    name = 'MySpider'
    allowed_domains = ['imooc.com']
    start_urls = ['http://www.imooc.com/course/list']
    def parse(self, response):
        sel = Selector(response)
        # Use xpath to select all the list contents
        sels = sel.xpath('//a[@class="item free "]')
        # Instance a container to store crawling information
        item = CourseItem()
        index = 0
        global pageIndex
        pageIndex += 1
        print('The first' + str(pageIndex) + 'page ')
        print('----------------------------------------------')
        # Traverse all lists
        for box in sels:
            # Get Course Title
            item['title'] = box.xpath('.//p[@class="title ellipsis2"]/text()').extract()[0].strip()
            # Get Title picture address
            item['image_url'] = "http:"+box.xpath('.//div[@class="img"]/@style').extract()[0].strip()[23:-2]
            # Get the nature of the course
            item['properties'] = box.xpath('.//span[@class="l"]/text()').extract()[0].strip()
            # Get course phase
            item['stage'] = box.xpath('.//p[@class="one"]/text()').extract()[0].strip().split(" · ")[0]
            # Get course enrollment
            item['enrollment'] = box.xpath('.//p[@class="one"]/text()').extract()[0].strip().split(" · ")[1][:-3]
            index += 1
            # Iteratively process the item and return a generator
            yield item
        next = u'next page'
        url = response.xpath("//a[contains(text(),'" + next + "')]/@href").extract()
        if url:
            # Combine the information into the url of the next page
            page = 'http://www.imooc.com' + url[0]
            # Return url
            yield scrapy.Request(page, callback=self.parse)
        pass

Re execute the project.

scrapy crawl MySpider


Finally, the crawled data can be saved to a file.

scrapy crawl MySpider -o data.csv


Follow up cases are continuously updated

01 HBase crown size query system based on Python
02 civil aviation customer value analysis based on Hive
03 analysis of pharmacy sales data based on python
04 web server log data collection based on Flume and Kafka
05 Muke network data acquisition and processing
06 Linux operating system real-time log collection and processing
07 medical industry case - Analysis of dialectical association rules of TCM diseases
08 education industry case - Analysis of College Students' life data
10 entertainment industry case - advertising revenue regression prediction model
11 network industry case - website access behavior analysis
12 retail industry case - real time statistics of popular goods in stores
13 visualization of turnover data
14 financial industry case - financial data analysis based on stock information of listed companies and its derivative variables
15 visualization of bank credit card risk data
Operation analysis of 16 didi cities
17 happiness index visualization
18 employee active resignation warning model
19 singer recommendation model
202020 novel coronavirus pneumonia data analysis
Data analysis of 21 Taobao shopping Carnival
22 shared single vehicle data analysis
23 face detection system
24 garment sorting system
25 mask wearing identification system
26 imdb movie data analysis

Topics: Python Python crawler