Data acquisition and fusion technology_ Experiment 3

Posted by BlueKai on Wed, 10 Nov 2021 02:11:19 +0100

1. Operation ①

1.1 operation content

Content: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 4 digits of the student number)
Output information: output the downloaded Url information on the console, store the downloaded image in the images sub file, and give a screenshot.

1.2 problem solving ideas

1.2.1 observe and analyze the main page and find that the picture is under the img tag and the sub website is in the href of the a tag

1.2.2 while traversing the website, compile spider function to crawl some pictures from each website

soup = BeautifulSoup(data, "lxml")
images = soup.select("img")
links = soup.select("a")
if len(images) >= 5:
    for i in range(5):
        if images[i]["src"].startswith("http") and images[i]["src"] not in imgs:
            imgs.append(images[i]["src"])
    if len(imgs) >= 121 or deep == 0:
        return
for link in links:
    if link["href"].startswith("http"):
        spider(link["href"],deep-1)

1.2.3 single thread downloading pictures

def download():
    global imgs
    for i in range(121):
        request.urlretrieve(imgs[i],r"D:\data acquisition\demo\practice\3\3.1\images\The first"+str(i+1)+"Zhang Tu.jpg")
        print("downloaded ", imgs[i])

1.2.4 multi thread downloading pictures

def download_threads():
    global imgs
    for i in range(121):
        T = threading.Thread(target=download, args=(imgs[i], i+1))
        T.setDaemon(False)
        T.start()
        threads.append(T)

1.2.5 results

1.3 experience

Single thread downloads pictures in order, which is very simple and intuitive, and the speed is slow. Multi thread does not necessarily execute in order, but the download speed is fast
Familiar with the traversal of the website and consolidated the understanding of multithreaded crawler

2. Operation ②

2.1 operation contents

Request: use the sketch framework to reproduce the job ①. Output information: the same as operation ①

2.2 problem solving ideas

2.2.1 set the unique parameter Src in item_ URL to store the picture address and write mySpider

class mySpider(scrapy.Spider):
    name = "mySpider"
    count = 0
    imgs = []
    def start_requests(self):
        url = "http://www.weather.com.cn/"
        yield scrapy.Request(url=url, callback=self.parse, meta={"deep":3})

    def parse(self, response):
        try:
            deep = response.meta["deep"]
            if deep <= 0 or self.count >= 121:
                return
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            images = selector.xpath("//img/@src").extract()
            n = 0
            for i in images:
                if i.startswith("http") and i not in self.imgs:
                    item = DemoItem()
                    item["src_url"] = i
                    self.imgs.append(i)
                    self.count += 1
                    n += 1
                    if n > 5:
                        break
                    yield item
            links = selector.xpath("//a/@href").extract()
            for link in links:
                if link.startswith("http"):
                    url = response.urljoin(link)
                    yield scrapy.Request(url=url, callback=self.parse, meta={"deep":deep-1})

        except Exception as err:
            print(err)

2.2.2 download pictures in pipelines

class DemoPipeline:
    count = 1
    def process_item(self, item, spider):
        try:
            if self.count <= 121:
                with open("./images/The first" + str(self.count) + "Zhang Tu.jpg", "wb") as f:
                    print(str(self.count)+":"+item["src_url"])
                    img = requests.get(item["src_url"]).content
                    f.write(img)
                    self.count += 1
            else:
                return
        except Exception as err:
            print(err)
        return item

2.2.3 modifying setting

BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'demo.pipelines.DemoPipeline': 300,
}

2.2.4 output results

2.3 experience

After reviewing the scratch crawler framework and learning how to use xpath, I feel that skilled xpath will be more convenient than CSS
Understand the parameter transmission mechanism of parse in scratch

Operation ③

3.1 operation contents

Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
All college information is stored in the database, and the process recording Gif of browser F12 debugging and analysis is added to the blog.

Output information:

Serial number	Movie title	director	performer	brief introduction	Film rating	Film cover
1	The Shawshank Redemption	Frank delabond	Tim Robbins	Want to set people free	9.7	./imgs/xsk.jpg
2....

3.2 problem solving ideas

3.2.1 analyze the web page content and find the location of the required content

3.2.2 write movieItem

class MovieItem(scrapy.Item):
    rank = scrapy.Field()
    name = scrapy.Field()
    director = scrapy.Field()
    actor = scrapy.Field()
    score = scrapy.Field()
    img = scrapy.Field()
    desp = scrapy.Field()

3.2.3 compiling spider

Crawl to page 10

    def start_requests(self):
        for i in range(10):
            url = "https://movie.douban.com/top250?start=" + str(i * 25)
            yield scrapy.Request(url=url,callback=self.parse)

parse function

    def parse(self, response):
        try:
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            # Get each movie information
            movies = selector.xpath("//li/div[@class='item']")
            for m in movies:
                rank = m.xpath("./div[@class='pic']/em/text()").extract_first()
                image = m.xpath("./div[@class='pic']/a/img/@src").extract_first()
                name = m.xpath("./div[@class='info']//span[@class='title']/text()").extract_first()
                members = m.xpath("./div[@class='info']//p[@class='']/text()").extract_first()
                desp = m.xpath("./div[@class='info']//p[@class='quote']/span/text()").extract_first()
                score = m.xpath("./div[@class='info']//span[@class='rating_num']/text()").extract_first()
                item = MovieItem()
                item['rank'] = rank
                item['name'] = name
                director = re.search(r'director:(.*?)\s main', members).group(1)
                actor = re.search(r'to star:(.*)', members)
                item['director'] = director
                #actor may be empty because there are cartoons without actors
                if actor == None:
                    item['actor'] = "null"
                else:
                    item["actor"] = actor.group(1)
                item['desp'] = desp
                item['score'] = score
                item['img'] = image
                yield item

        except Exception as err:
            print(err)

3.2.4 compiling database classes

class MovieDB:
    def openDB(self):
        self.con = sqlite3.connect("movies.db")
        self.cursor = self.con.cursor()
        try:
            self.cursor.execute("create table movies (Rank int,Name varchar(32),Director varchar(32),"
                                "Actors varchar(64),Description varchar(64),Score varchar(8),ImgPath varchar(64))")
        except:
            self.cursor.execute("delete from movies")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, Rank,Name,Director,Actors,Description,Score,ImgPath):
        try:
            self.cursor.execute("insert into movies (Rank,Name,Director,Actors,Description,Score,ImgPath) "
                                "values (?,?,?,?,?,?,?)", (Rank,Name,Director,Actors,Description,Score,ImgPath))
        except Exception as err:
            print(err)

3.2.5 compiling pipelines

class MoviespiderPipeline:
    def open_spider(self, spider):
        self.db = MovieDB()
        self.db.openDB()

    def process_item(self, item, spider):
        path = r"./images/"+item['name']+".jpg"
        url = item['img']
        img = requests.get(url).content
        with open(path,"wb") as f:
            f.write(img)
            print("The first"+item['rank']+"Page cover downloaded successfully")
        self.db.insert(int(item['rank']),item['name'],item['director'],item['actor'],item['desp'],item["score"],path)
        print("The first" + item['rank'] +'Movie data inserted successfully')
        return item

    def close_spider(self, spider):
        self.db.closeDB()
        print("End crawling")

3.2.6 modifying setting
3.2.7 output results

3.3 experience

More proficient in the use of scratch and xpath

Programmer Think