Data acquisition and fusion technology_ Experiment 3

Posted by BlueKai on Wed, 10 Nov 2021 02:11:19 +0100

Code cloud link: Experiment 3

1. Operation ①

1.1 operation content

  • Content: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 4 digits of the student number)
  • Output information: output the downloaded Url information on the console, store the downloaded image in the images sub file, and give a screenshot.

1.2 problem solving ideas

1.2.1 observe and analyze the main page and find that the picture is under the img tag and the sub website is in the href of the a tag

1.2.2 while traversing the website, compile spider function to crawl some pictures from each website

soup = BeautifulSoup(data, "lxml")
images = soup.select("img")
links = soup.select("a")
if len(images) >= 5:
    for i in range(5):
        if images[i]["src"].startswith("http") and images[i]["src"] not in imgs:
            imgs.append(images[i]["src"])
    if len(imgs) >= 121 or deep == 0:
        return
for link in links:
    if link["href"].startswith("http"):
        spider(link["href"],deep-1)

1.2.3 single thread downloading pictures

def download():
    global imgs
    for i in range(121):
        request.urlretrieve(imgs[i],r"D:\data acquisition\demo\practice\3\3.1\images\The first"+str(i+1)+"Zhang Tu.jpg")
        print("downloaded ", imgs[i])

1.2.4 multi thread downloading pictures

def download_threads():
    global imgs
    for i in range(121):
        T = threading.Thread(target=download, args=(imgs[i], i+1))
        T.setDaemon(False)
        T.start()
        threads.append(T)

1.2.5 results

1.3 experience

  • Single thread downloads pictures in order, which is very simple and intuitive, and the speed is slow. Multi thread does not necessarily execute in order, but the download speed is fast
  • Familiar with the traversal of the website and consolidated the understanding of multithreaded crawler

2. Operation ②

2.1 operation contents

Request: use the sketch framework to reproduce the job ①. Output information: the same as operation ①

2.2 problem solving ideas

2.2.1 set the unique parameter Src in item_ URL to store the picture address and write mySpider

class mySpider(scrapy.Spider):
    name = "mySpider"
    count = 0
    imgs = []
    def start_requests(self):
        url = "http://www.weather.com.cn/"
        yield scrapy.Request(url=url, callback=self.parse, meta={"deep":3})

    def parse(self, response):
        try:
            deep = response.meta["deep"]
            if deep <= 0 or self.count >= 121:
                return
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            images = selector.xpath("//img/@src").extract()
            n = 0
            for i in images:
                if i.startswith("http") and i not in self.imgs:
                    item = DemoItem()
                    item["src_url"] = i
                    self.imgs.append(i)
                    self.count += 1
                    n += 1
                    if n > 5:
                        break
                    yield item
            links = selector.xpath("//a/@href").extract()
            for link in links:
                if link.startswith("http"):
                    url = response.urljoin(link)
                    yield scrapy.Request(url=url, callback=self.parse, meta={"deep":deep-1})

        except Exception as err:
            print(err)

2.2.2 download pictures in pipelines

class DemoPipeline:
    count = 1
    def process_item(self, item, spider):
        try:
            if self.count <= 121:
                with open("./images/The first" + str(self.count) + "Zhang Tu.jpg", "wb") as f:
                    print(str(self.count)+":"+item["src_url"])
                    img = requests.get(item["src_url"]).content
                    f.write(img)
                    self.count += 1
            else:
                return
        except Exception as err:
            print(err)
        return item

2.2.3 modifying setting

BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'demo.pipelines.DemoPipeline': 300,
}

2.2.4 output results

2.3 experience

  • After reviewing the scratch crawler framework and learning how to use xpath, I feel that skilled xpath will be more convenient than CSS
  • Understand the parameter transmission mechanism of parse in scratch

Operation ③

3.1 operation contents

  • Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
    All college information is stored in the database, and the process recording Gif of browser F12 debugging and analysis is added to the blog.

  • Output information:

    Serial number Movie title director performer brief introduction Film rating Film cover
    1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg
    2....

3.2 problem solving ideas

3.2.1 analyze the web page content and find the location of the required content

3.2.2 write movieItem

class MovieItem(scrapy.Item):
    rank = scrapy.Field()
    name = scrapy.Field()
    director = scrapy.Field()
    actor = scrapy.Field()
    score = scrapy.Field()
    img = scrapy.Field()
    desp = scrapy.Field()

3.2.3 compiling spider

Crawl to page 10

    def start_requests(self):
        for i in range(10):
            url = "https://movie.douban.com/top250?start=" + str(i * 25)
            yield scrapy.Request(url=url,callback=self.parse)

parse function

    def parse(self, response):
        try:
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            # Get each movie information
            movies = selector.xpath("//li/div[@class='item']")
            for m in movies:
                rank = m.xpath("./div[@class='pic']/em/text()").extract_first()
                image = m.xpath("./div[@class='pic']/a/img/@src").extract_first()
                name = m.xpath("./div[@class='info']//span[@class='title']/text()").extract_first()
                members = m.xpath("./div[@class='info']//p[@class='']/text()").extract_first()
                desp = m.xpath("./div[@class='info']//p[@class='quote']/span/text()").extract_first()
                score = m.xpath("./div[@class='info']//span[@class='rating_num']/text()").extract_first()
                item = MovieItem()
                item['rank'] = rank
                item['name'] = name
                director = re.search(r'director:(.*?)\s main', members).group(1)
                actor = re.search(r'to star:(.*)', members)
                item['director'] = director
                #actor may be empty because there are cartoons without actors
                if actor == None:
                    item['actor'] = "null"
                else:
                    item["actor"] = actor.group(1)
                item['desp'] = desp
                item['score'] = score
                item['img'] = image
                yield item

        except Exception as err:
            print(err)

3.2.4 compiling database classes

class MovieDB:
    def openDB(self):
        self.con = sqlite3.connect("movies.db")
        self.cursor = self.con.cursor()
        try:
            self.cursor.execute("create table movies (Rank int,Name varchar(32),Director varchar(32),"
                                "Actors varchar(64),Description varchar(64),Score varchar(8),ImgPath varchar(64))")
        except:
            self.cursor.execute("delete from movies")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, Rank,Name,Director,Actors,Description,Score,ImgPath):
        try:
            self.cursor.execute("insert into movies (Rank,Name,Director,Actors,Description,Score,ImgPath) "
                                "values (?,?,?,?,?,?,?)", (Rank,Name,Director,Actors,Description,Score,ImgPath))
        except Exception as err:
            print(err)

3.2.5 compiling pipelines

class MoviespiderPipeline:
    def open_spider(self, spider):
        self.db = MovieDB()
        self.db.openDB()

    def process_item(self, item, spider):
        path = r"./images/"+item['name']+".jpg"
        url = item['img']
        img = requests.get(url).content
        with open(path,"wb") as f:
            f.write(img)
            print("The first"+item['rank']+"Page cover downloaded successfully")
        self.db.insert(int(item['rank']),item['name'],item['director'],item['actor'],item['desp'],item["score"],path)
        print("The first" + item['rank'] +'Movie data inserted successfully')
        return item

    def close_spider(self, spider):
        self.db.closeDB()
        print("End crawling")

3.2.6 modifying setting
3.2.7 output results

3.3 experience

  • More proficient in the use of scratch and xpath