The third major operation of data acquisition

Posted by rsammy on Sun, 07 Nov 2021 07:24:04 +0100

Assignment 1

  • requirement
    Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)

  • Output content
    Output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.

    Result display

  • Single thread output information

  • Single thread image saving results

  • Multithreaded output information

  • Multi thread picture saving results

  • Get sub link code of web page

      def spider1(url):
          req = urllib.request.Request(url=url)
          res = urllib.request.urlopen(req)
          data = res.read()
          data = data.decode("utf-8")
          ##Regular matching
          list = re.findall(r'target="_blank" href="(.*?)"',data)
          ##Returns a list of child links
          return list
    
    
  • Single thread crawls the picture in the sub link and saves it

     def spider2(url):
         ##Number of saved pictures
         global sum
         req = urllib.request.Request(url=url)
         res = urllib.request.urlopen(req)
         data = res.read()
         data = data.decode("utf-8")
         list = re.findall(r'class="lunboimgages" src="(.*?)"', data)
         for j in list:
             print(str(sum)+j)
             j.replace("\\", "")
             ##Picture address
             filename = "D:/weather/" + "The first" + str(sum) + "Zhang" + ".jpg"
             sum+=1
             ##Download and save pictures
             urllib.request.urlretrieve(j, filename)
             if (sum == 119):
                 return
    
  • Multi thread crawling sub web page image and saving

     def spider2(url):
         global sum
         threads = []
         req = urllib.request.Request(url=url)
         res = urllib.request.urlopen(req)
         data = res.read()
         data = data.decode("utf-8")
         list = re.findall(r'class="lunboimgages" src="(.*?)"', data)
         for j in list:
             print(str(sum)+j)
             j.replace("\\", "")
             filename = "D:/weather/" + "The first" + str(sum) + "Zhang" + ".jpg"
             sum+=1
             ##Set thread
             T = threading.Thread(target=download, args=(j, filename))
             T.setDaemon(False)
             ##start-up
             T.start()
             threads.append(T)
             if (sum == 119):
                 return
         for thread in threads:
             thread.join()
    
    
  • Code cloud link
    ----Assignment 1-----

    experience

  • Because the number of pictures of a web page is not enough, you need to get the pictures of sub web pages
    Compared with the previous homework, there is only one more step to obtain the sub web page, which is not very difficult
    Note that when setting multithreading, you should traverse the join thread

Assignment 2

  • requirement
    Use the sketch framework to reproduce the job ①.

  • Output content
    Same as operation ①

    Result display

  • Output information

  • Save picture information

  • Get sub web page code

      def start_requests(self):
          url='http://p.weather.com.cn/tqxc/index.shtml'
          yield  Request(url=url,callback=self.parse1)
      ##Find sub web page url callback parse2
      def parse1(self, response):
          global sum
          data = response.body.decode()
          selector = scrapy.Selector(text=data)
          urls = selector.xpath("//div[@class='oi']/div[@class ='tu']/a/@href")
          for i in range(len(urls)):
              url = urls[i].extract()
              if sum!=119:
                  yield scrapy.Request(url=url, callback=self.parse2)
    
  • Get the sub web page picture link and save it to item

     def parse2(self, response):
         global sum
         data = response.body.decode()
         selector = scrapy.Selector(text=data)
         # Get picture path
         pics_url = selector.xpath("//li[@class='child']/a[@class='img_back']/img/@src")
         for i in pics_url:
             sum+=1
             url = i.extract()
             item = FirstspiderItem()
             if sum<119:
                 item['url'] = url
    
             yield item
         pass
    
  • pipelines code

      import requests
      class FirstspiderPipeline:
          def open_spider(self,spider):
              self.count=1
          def process_item(self, item, spider):
              
              url=item['url']
              resp=requests.get(url)
              img=resp.content
    
              with open('D:\image\%d' %(self.count)+'.jpg','wb') as f:
                  f.write(img)
                  print('%d Pictures downloaded successfully'%(self.count))
                  self.count+=1
              return item
    
  • item code

       class FirstspiderItem(scrapy.Item):
           
           url=scrapy.Field()
           pass
    
    
  • settings code

       BOT_NAME = 'firstspider'
       SPIDER_MODULES = ['firstspider.spiders']
       NEWSPIDER_MODULE = 'firstspider.spiders'
       FEED_EXPORT_ENCODING = 'gb18030'
       ITEM_PIPELINES = { 'firstspider.pipelines.FirstspiderPipeline': 300,
                          }
       IMAGES_STORE='D:\image'
       IMAGES_URLS_FIELD='url'
       DEFAULT_REQUEST_HEADERS={
       'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
       'Cookie': 'GRIDSUMID=c34417679352e71a457dfdfb9b42d24a',
       'Accept': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
       'Accept-Language':'zh-CN,zh;q=0.9'
       }
       ROBOTSTXT_OBEY = False
    
  • Code cloud link
    ------Assignment 2-----

experience

  • The crawl speed of scratch is fast
    Crawling to save a picture is similar to multithreading
    It is not difficult for the main function to call the callback function by crawling the pipeline link. The key is to set the settings correctly
    The frame idea of "crawl" in "sweep" is clearer and clearer

Assignment 3

  • requirement
    Crawl Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.

  • Candidate web address
    https://movie.douban.com/top250

  • Output content

    Serial number name director performer brief introduction score Picture cover
    1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg
    2 ~ ~ ~ ~ ~ ~

    Result display

  • database

  • Saved pictures

  • moive.py corresponding function

      def start_requests(self):
       ##Page turning processing
          for i in range(10):
              time.sleep(random.uniform(0.7,2.0))
              url='https://movie.douban.com/top250?start='+str(i*25)+'&filter='
              yield  Request(url=url,callback=self.parse1)
      ##xpath selects the corresponding item content and passes it into item
      def parse1(self, response):
          global count
          data = response.body.decode()
          selector = scrapy.Selector(text=data)
          # Get each movie item
          movies = selector.xpath(
              "//ol[@class='grid_view']/li")
          ##Select the corresponding label content under each movie label
          for i in movies:
              image=i.xpath("./div[@class='item']/div[@class='pic']/a/img/@src").extract_first()
              name=i.xpath("./div[@class='item']/div[@class='info']/div[@class='hd']//span[@class='title']/text()").extract_first()
              directorandactor=i.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first()
              desp=i.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span/text()").extract_first()
              grade = i.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/div/span[@class='rating_num']/text()").extract_first()
              print(image)
              ##Regular conversion of directors and leading actors to facilitate subsequent access to corresponding content
              directorandactor=directorandactor.replace(' ','')
              directorandactor = directorandactor.replace('\n', '')
              directorandactor=directorandactor+'\n'
              director=re.findall(r'director:(.*?)\s',directorandactor)
              actor=re.findall(r'to star:(.*?)\n',directorandactor)
              count += 1
              item = MovieSpiderItem()
              #Save to corresponding item
              item['count']=str(count)
              item['mname'] = str(name)
              item['director'] = str(director[0])
              if(len(actor)!=0):##The actor may be empty because there is no actor in the animation or the actor cannot be displayed because the director's name is too long
                  item['actor'] = str(actor[0])
              else:
                  item['actor']='null'
              item['story'] = str(desp)
              item['grade'] = str(grade)
              item['image'] = str(image)
              yield item
          pass
    
  • Settings settings

     import os
     BOT_NAME = 'movie_spider'
     SPIDER_MODULES = ['movie_spider.spiders']
     NEWSPIDER_MODULE = 'movie_spider.spiders'
     FEED_EXPORT_ENCODING = 'gb18030'
     ITEM_PIPELINES = {
        'movie_spider.pipelines.MovieSpiderPipeline': 300,
     }
     IMAGES_STORE='D:\PY\movie_spider\image'
     IMAGES_URLS_FIELD='image'
     HTTPERROR_ALLOWED_CODES = [301]
     DEFAULT_REQUEST_HEADERS={
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
     }
    
     ROBOTSTXT_OBEY = False
    
    
  • pipelines class

         class MovieDB:
             # Establish a database link. If there is no corresponding database, create it
             def openDB(self):
                 self.con = sqlite3.connect("movies.db")
                 self.cursor = self.con.cursor()
                 try:
                     ##Database creation command
                     self.cursor.execute("create table movies "
                                         "(Name varchar(20),"
                                         "Director varchar(30),Actors varchar(60),"
                                         "Desp varchar(100),Grade varchar(8),ImgPath varchar(64))")
                 except:
                     self.cursor.execute("delete from movies")
             ##close database
             def closeDB(self):
                 self.con.commit()
                 self.con.close()
             ##Insert operation
             def insert(self,  Name, Director, Actors, Desp, Grade, ImgPath):
                 try:
                     self.cursor.execute("insert into movies (Name,Director,Actors,Desp,Grade,ImgPath) values (?,?,?,?,?,?)",
                                         (  Name, Director, Actors, Desp, Grade, ImgPath))
                 except Exception as err:
                     print(err)
         ##Pipeline class
         class MovieSpiderPipeline:
             def open_spider(self, spider):
                 self.count = 1
                 self.db = MovieDB()
                 self.db.openDB()
             ##Save picture and write to database
             def process_item(self, item, spider):
                 # Download pictures
                 path = r"D:\PY\movie_spider\image"
                 url = item['image']
                 resp = requests.get(url)  # Gets the binary text of the specified picture
                 img = resp.content
                 fm_path= 'image\image%d' % self.count + '.jpg'
                 with open(path + '\\image%d' % self.count + '.jpg', 'wb') as f:
                     f.write(img)
                     print("The first%d Pictures downloaded successfully" % self.count)
                 self.count += 1
                 # Save to database
                 self.db.insert( item['mname'], item['director'], item['actor'], item['story'], item['grade'],
                                fm_path)
                 return item
    
             def close_spider(self, spider):
                 self.db.closeDB()
                 print("End crawling")
    
  • Code cloud link
    -----Assignment 3-----

experience

  • Originally, it was planned to jump to a separate movie introduction page to crawl the content according to the picture link, but Douban on these pages will set anti crawl, and the page content obtained by python is different from that seen by the browser. Moreover, it was blocked ip after climbing several times, so it was abandoned
    This leads to the lack of some actors, so we can only give up this small part of the content for the time being
    The code of the database part is written quickly before reference.
    Other codes are similar to operation 2. In the future, we should find a way to do anti crawling crawler