The third experiment of data acquisition

Posted by mlschutz on Sat, 30 Oct 2021 04:52:16 +0200

Operation ①:


    My Gitee (single)
    My Gitee (more)

  • Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network

    http://www.weather.com.cn

    . Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 4 digits of the student number)

  • Output information:

    Output the downloaded Url information on the console, store the downloaded image in the weather sub file, and give a screenshot.

    (1) Crawl the content of China Meteorological Website

    Single thread experiment process:
    1. First, check the page through the main url, get all a[href] links under the page, save them all and use them later
    Get similar link s to ensure the number of pictures.

    (code below)

    start_url = "http://www.weather.com.cn"
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
    r = requests.get(start_url, headers=header)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    # Parsing html
    a = '<a href="(.*?)" '
    linklist = re.findall(re.compile(a), str(soup))
    # Regular expression get link
    

    2. Define and encapsulate a function to obtain picture links under various URLs, and get the following code:

    def imagelist(link):
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
        r = requests.get(link, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        images = soup.select("img")
        p = r'img.*?src="(.*?)"'
        imagelist = []
        str1 = re.findall(p, str(images), re.S)
        # # print(str1)
        for i in range(len(str1)):
            if str1[i] not in imagelist:
                imagelist.append(str1[i])if str1[i] else ""
                # print(imagelist)
                # print(len(imagelist))
        return imagelist
    

    3. Define a function to download pictures (as follows)

    def download(link):
        global count
        file = "E:/weather/" + "The first" + str(count+1) + "Zhang.jpg"  # file means to create a related folder in the specified folder before crawling successfully
        print("The first" + str(count+1) + "Zhang crawling succeeded")
        count += 1
        urllib.request.urlretrieve(url=link, filename=file)
    

    4. Operation results:

    Multithreading experiment process:
    1. Similar to the single thread experiment, multithreading is used to speed up the loading speed.
    First parse the html and get multiple link s in the same way

    start_url = "http://www.weather.com.cn"
    headers= {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
    r = requests.get(start_url, headers=headers)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    linklist = []
    a = '<a href="(.*?)" '
    linklist1 = re.findall(re.compile(a), str(soup))
    for i in range(len(linklist1)):
        if i < 20:
            linklist.append(linklist1[i]) if linklist1[i] != "javascript:void(0);" else ""
    # print(len(linklist))
    # Here you get all the link s
    threads=[]
    for i in linklist:
        imageSpider(i)
    # Call link back to the imageSpider function
    

    2. Get all img links in a single page

    soup=BeautifulSoup(data,"html.parser")
            images=soup.select("img")
            for image in images:
                try:
                    src=image["src"]
                    url=urllib.request.urljoin(start_url,src)
    

    The picture link obtained here is src under img, but it is mixed due to the website http://www.weather.com.cn This website, so here's the judgment:

                    if url not in urls:
                        if not str(url).endswith('.cn'):
    

    3. Finally, encapsulate the download function for multi-threaded crawling: first process the url of each picture to obtain the last four bits of each picture link, that is, the format of the picture: finally write to the file

    def download(url,count):
        try:
            if(url[len(url)-4]=="."):
                type=url[len(url)-4:]
            else:
                type=""
            req=urllib.request.Request(url,headers=headers)
            data=urllib.request.urlopen(req,timeout=100)
            data=data.read()
            fobj=open("E:\images\\"+str(count)+type,"wb")
            fobj.write(data)
            fobj.close()
            print("downloaded "+str(count)+type+"\n")
        except Exception as err:
            print(err)
    

    4. Multithreading setting:

    T=threading.Thread(target=download,args=(url,count))
                            T.setDaemon(False)
                            T.start()
                            threads.append(T)
    
    for t in threads:
        t.join()
    print("The End")
    

    5. Operation results:

    experience:

    Through this experiment, I first reviewed the basic operation of downloading pictures from the website and how to use multithreading to speed up the download speed. The difference in this experiment is that you don't crawl pictures in the same single page. First, you need to find all its branches through the main url and download a limited number of pictures. The overall completion is good, and I have mastered it skillfully.

    Operation ②:


      My Gitee

    • Requirements: use the sketch framework to reproduce the operation ①.
    • Output information: Same as operation ①

      (2) Download website pictures using the scratch framework

      Experimental process:
      1. First create the scene project

      2. Write myspiders, create myspiders class, and define crawler name and start_urls

      class myspides(scrapy.Spider):
          # Set basic parameters
      
          name = 'myspiders'
          start_urls = ["http://www.weather.com.cn"]
      

      2. The same as operation 1, define a function to get picture links

          def image(self,link):
              header = {
                  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
              r = requests.get(link, headers=header)
              r.raise_for_status()
              r.encoding = r.apparent_encoding
              data = r.text
              soup = bs4.BeautifulSoup(data, "html.parser")
              images = soup.select("img")
              p = r'.+?src="(\S+)"'
              imagelist = []
              str1 = re.findall(p, str(images), re.S)
              # # print(str1)
              for i in range(len(str1)):
                  if str1[i] not in imagelist:
                      imagelist.append(str1[i]) if str1[i] else ""
                      # print(imagelist)
                      # print(len(imagelist))
              return imagelist
      

      3. Obtain a[href] in the parse function

      a = soup.select("a[href]")
                      links = []
                      for link in a:
                          links.append(link["href"]) if link["href"] != 'javascript:void(0)' else ""
      

      4. Then download the pictures:

      for i in images:
          item["images"]= i
          item["name"] = "E:/weather/" + str(count+1) + ".jpg"
          urllib.request.urlretrieve(item["images"], filename=item["name"])
          count += 1
          if count <=112:
             break
          yield item
      

      5. In item.py, define the item items used in myspiders, where item["images"] and item["name"] are defined

      6. Set in the setting.py file:
      (note that the web crawler protocol should be ignored here)

      ROBOTSTXT_OBEY = False
      

      In other settings:

      Operation results

      experience

      The main idea of the experiment is the same as that in assignment ①. It is mainly required to use the scratch framework for the experiment. It is estimated that the teacher is to let us skillfully use the scratch framework. After this experiment, we are really proficient. See Experiment III for the detailed reasons.

      Operation ③:


        My Gitee

      • Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.

        Candidate sites: https://movie.douban.com/top250

      • Output information:
        Serial number Movie title director performer brief introduction Film rating Film cover
        1 The Shawshank Redemption Frank delabond Tim Robbin Want to set people free 9.7 ./imgs/xsk.jpg
        2....

        (3) Information about climbing watercress

        Experimental process:
        1. First observe the Douban website, and it can be found that the information to be crawled is distributed in the page:

        However, a problem has been found here. The list of actors to be crawled is incomplete under this page and the director's name is chaotic. In order to improve the crawler, we decided to crawl this part of the link in the external interface, enter the link and crawl the actor information.

        2. After analyzing the web page, we use the sketch framework to crawl the website. First, create the scene project.

        3. Write the main function myspiders, create myspides (scratch. Spider): define the crawler name and start_urls

         name = 'myspiders'
         allowed_domains = ['movie.douban.com']
         start_urls = ['https://movie.douban.com/top250?start=0']
        
        

        4. Encapsulate the parse(self, response) function, parse html, and query the data using xpath

         title = li.xpath("./div[@class='pic']/a/img/@alt").extract_first()
         img = li.xpath("./div[@class='pic']/a/img/@src").extract_first()
         score = li.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[position()=2]/text()").extract_first()
         rank = li.xpath("./div[@class='pic']/em/text()").extract_first()
         comment = li.xpath("./div[@class='info']/div[@class='bd']/p[position()=2]/span/text()").extract_first()
        

        5. Image downloading:

        file = "E:/movie/" + str(item["title"]) +  ".jpg"  # file means to create a related folder in the specified folder before crawling successfully
        urllib.request.urlretrieve(item["img"], filename=file)
        item["file"] = file.strip()
        

        6. Continue to use xpath to crawl the link of the movie column, and enter the link to crawl the name of the director and actor

        link = li.xpath("./div[@class='pic']/a/@href").extract_first()
        soup = bs4.BeautifulSoup(r.text, "html.parser")
        title = soup.select("div[id='info']")
        reg = r'<a href=".*?" rel=".*?">(.*?)</a>'
        actor = re.findall(reg,str(title))
        # print(actor[0])
        act = soup.select("span[class='actor'] span[class='attrs']")[0].text
        # print(act)
        item["actor"] = actor[0] if actor else ""
        item["star"] = act if act else ""
        yield item
        

        7. Finally, turn pages of myspiders.py:

        page = selector.xpath(
                            "//div[@class='paginator']/span[@class='thispage']/following-sibling::a[1]/@href").extract_first()
        print(page)
        link_nextpage = "https://movie.douban.com/top250" + str(page)
        if page:
           url = response.urljoin(link_nextpage)
           yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
        

        8. Then process the item.py file and define the required items class

        class DemoItem(scrapy.Item):
            # define the fields for your item here like:
            # name = scrapy.Field()
            title = scrapy.Field()
            img = scrapy.Field()
            link = scrapy.Field()
            score = scrapy.Field()
            comment = scrapy.Field()
            rank = scrapy.Field()
            actor = scrapy.Field()
            star = scrapy.Field()
            file = scrapy.Field()
            pass
        

        9. To handle the settings, I only set 'demo.pipelines.DemoPipeline': 300 at the beginning. I found that no matter how I crawled, there was no content. I struggled to find the reason. Finally, I found that the website crawler protocol was not ignored, so I made the following modifications:

        ROBOTSTXT_OBEY = False
        USER_AGENT = 'demo (+http://www.yourdomain.com)'
        

        10. Write pipelines, mainly for output processing and storage in the database, and create database tables here

                if flag:
                    self.con = sqlite3.connect("movie.db")
                    self.cursor = self.con.cursor()
                    try:
                        self.cursor.execute(
                            "create table movie (int Mrank ,Mtitle varchar(256),Mactor varchar(256),Mstar text,Mcomment varchar(256),Mscore varchar(256),Mfile varchar(256),constraint pk_movie primary key (Mrank,Mtitle))")
                        flag = False
                    except:
                        self.cursor.execute("delete from movie")
        

        insert data

        self.cursor.execute("insert into movie (Mrank,Mtitle,Mactor,Mstar,Mcomment,Mscore,Mfile) values (?,?,?,?,?,?,?)",
                                    (item["rank"], item["title"],item["actor"],item["star"], item["comment"], item["score"], item["file"]))
        

        When inserting data into the database for the first time, I was surprised to find that several pieces of data were missing. Finally, I found that Mstar varchar(256) was originally set
        Part of the data crawled is too long to be inserted. Finally, it is set to Mstar text, which solves the problem.

        Operation results

        You can see the list of actors and all climb down.

        experience

        In the process of completing this operation, we are faced with the problem that the website is abnormal and inaccessible for many times. It may be that there are too many crawls,

        In the future, the html of the page must be cached first. After many tests, the crawler contacts the more complex logic for the first time and climbs into a page again. I feel that after learning selenium, it should not be so troublesome to deal with such problems. Secondly, I'm becoming more and more proficient in scratch. I feel that crawler is still a technology that needs to be practiced frequently. This time, I also use mysql and sqlite databases to store data. I feel similar. Finally, I thank teacher Wu Ling for her guidance in the practice class, otherwise I won't get out of some debug parts...