Scratch: Top 250 crawling

Posted by JCBarry on Thu, 12 Dec 2019 20:53:55 +0100

Content of this article

  • Climb to the top 250 page of Douban movie. The fields include:
    Ranking, title, director, one sentence Description: some are blank, rating, number of evaluation, release time, release country, category
  • Grab data storage

scrapy introduction

Scrapy crawler framework tutorial (1) -- getting started with scrapy

Create project

scrapy startproject dbmovie

Create crawler

cd dbmoive
scarpy genspider dbmovie_spider movie.douban.com/top250

Note that the crawler name cannot be the same as the item name

Configuration of anti climbing strategy

  • Open the settings.py file, and change robotstxt? Obey to False.

    ROBOTSTXT_OBEY = False
  • Modify user agent

    DEFAULT_REQUEST_HEADERS = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en',
        'Accept-Encoding' :  'gzip, deflate, br',
        'Cache-Control' :  'max-age=0',
        'Connection' :  'keep-alive',
        'Host' :  'movie.douban.com',
        'Upgrade-Insecure-Requests' :  '1',
        'User-Agent' :  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    }

Run crawler

scrapy crawl dbmovie_spider

Define item

According to the previous analysis, we need to retrieve the information of a total of ten fields. Now, define the item in the items.py file

import scrapy

class DoubanItem(scrapy.Item):
    # ranking
    ranking = scrapy.Field()
    # Title 
    title = scrapy.Field()
    # director
    director = scrapy.Field()
    # One sentence description is empty
    movie_desc = scrapy.Field()
    # score
    rating_num = scrapy.Field()
    # Number of people assessed
    people_count = scrapy.Field()
    # Release time
    online_date = scrapy.Field()
    # National showing
    country = scrapy.Field()
    # category
    category = scrapy.Field()

Field extraction

You need to use xpath knowledge here, steal a lazy one, and use the chrome plug-in directly to get it
How to get XPATH from Chrome browser -- through developer tools

def parse(self, response):
    item = DoubanItem()
    movies = response.xpath('//div[@class="item"]')
    for movie in movies:
        # Ranking
        item['ranking'] = movie.xpath('div[@class="pic"]/em/text()').extract()[0]
        # Title Extraction multiple titles
        titles = movie.xpath('div[@class="info"]/div[1]/a/span/text()').extract()[0]
        item['title'] = titles
        # Get director information
        info_director = movie.xpath('div[2]/div[2]/p[1]/text()[1]').extract()[0].replace("\n", "").replace(" ", "").split('\xa0')[0]
        item['director'] = info_director
        # Release date
        online_date = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").replace('\xa0', '').split("/")[0].replace(" ", "")
        # Producer country
        country = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").split("/")[1].replace('\xa0', '')
        # Film type
        category = movie.xpath('div[2]/div[2]/p[1]/text()[2]').extract()[0].replace("\n", "").split("/")[2].replace('\xa0', '').replace(" ", "")
        item['online_date'] = online_date
        item['country'] = country
        item['category'] = category
        movie_desc = movie.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
        if len(movie_desc) != 0:  # Judge whether the value of info is empty. Without this step, some movie information will not report errors or incomplete data
            item['movie_desc'] = movie_desc
        else:
            item['movie_desc'] = ' '

        item['rating_num'] = movie.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
        item['people_count'] = movie.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[4]/text()').extract()[0]
        yield item
    # Get next page
    next_url = response.xpath('//span[@class="next"]/a/@href').extract()
    
    if next_url:
        next_url = 'https://movie.douban.com/top250' + next_url[0]
        yield scrapy.Request(next_url, callback=self.parse, dont_filter=True)

Storing data, mysql

Note 1064 error, because the fields in the table contain mysql keyword
Writing to database in scratch tutorial

import pymysql

def dbHandle():
    conn = pymysql.connect(
        host='localhost',
        user='root',
        passwd='pwd',
        db="dbmovie",
        charset='utf8',
        use_unicode=False
    )
    return conn

class DoubanPipeline(object):
    def process_item(self, item, spider):
        dbObject = dbHandle()
        cursor = dbObject.cursor()
        sql = "insert into db_info(ranking,title,director,movie_desc,rating_num,people_count,online_date,country,category) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"

        try:
            cursor.execute(sql, (item['ranking'], item['title'], item['director'], item['movie_desc'], item['rating_num'], item['people_count'], item['online_date'], item['country'], item['category']))
            dbObject.commit()
        except Exception as e:
            print(e)
            dbObject.rollback()

        return item

Simple strategies for dealing with reptiles

Scrapy crawler: analysis of the most complete strategy to break through anti crawler

Topics: Python xml MySQL SQL encoding