The third experiment of data acquisition

Posted by mlschutz on Sat, 30 Oct 2021 04:52:16 +0200

Operation ①:

Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network
http://www.weather.com.cn

. Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 4 digits of the student number)

Output information:

Output the downloaded Url information on the console, store the downloaded image in the weather sub file, and give a screenshot.

(1) Crawl the content of China Meteorological Website

Single thread experiment process:
1. First, check the page through the main url, get all a[href] links under the page, save them all and use them later
Get similar link s to ensure the number of pictures.

(code below)

start_url = "http://www.weather.com.cn"
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
r = requests.get(start_url, headers=header)
r.raise_for_status()
r.encoding = r.apparent_encoding
data = r.text
soup = BeautifulSoup(data, "html.parser")
# Parsing html
a = '<a href="(.*?)" '
linklist = re.findall(re.compile(a), str(soup))
# Regular expression get link

2. Define and encapsulate a function to obtain picture links under various URLs, and get the following code:

def imagelist(link):
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
    r = requests.get(link, headers=header)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    images = soup.select("img")
    p = r'img.*?src="(.*?)"'
    imagelist = []
    str1 = re.findall(p, str(images), re.S)
    # # print(str1)
    for i in range(len(str1)):
        if str1[i] not in imagelist:
            imagelist.append(str1[i])if str1[i] else ""
            # print(imagelist)
            # print(len(imagelist))
    return imagelist

3. Define a function to download pictures (as follows)

def download(link):
    global count
    file = "E:/weather/" + "The first" + str(count+1) + "Zhang.jpg"  # file means to create a related folder in the specified folder before crawling successfully
    print("The first" + str(count+1) + "Zhang crawling succeeded")
    count += 1
    urllib.request.urlretrieve(url=link, filename=file)

4. Operation results:

Multithreading experiment process:
1. Similar to the single thread experiment, multithreading is used to speed up the loading speed.
First parse the html and get multiple link s in the same way

start_url = "http://www.weather.com.cn"
headers= {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
r = requests.get(start_url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
data = r.text
soup = BeautifulSoup(data, "html.parser")
linklist = []
a = '<a href="(.*?)" '
linklist1 = re.findall(re.compile(a), str(soup))
for i in range(len(linklist1)):
    if i < 20:
        linklist.append(linklist1[i]) if linklist1[i] != "javascript:void(0);" else ""
# print(len(linklist))
# Here you get all the link s
threads=[]
for i in linklist:
    imageSpider(i)
# Call link back to the imageSpider function

2. Get all img links in a single page

soup=BeautifulSoup(data,"html.parser")
        images=soup.select("img")
        for image in images:
            try:
                src=image["src"]
                url=urllib.request.urljoin(start_url,src)

The picture link obtained here is src under img, but it is mixed due to the website http://www.weather.com.cn This website, so here's the judgment:

                if url not in urls:
                    if not str(url).endswith('.cn'):

3. Finally, encapsulate the download function for multi-threaded crawling: first process the url of each picture to obtain the last four bits of each picture link, that is, the format of the picture: finally write to the file

def download(url,count):
    try:
        if(url[len(url)-4]=="."):
            type=url[len(url)-4:]
        else:
            type=""
        req=urllib.request.Request(url,headers=headers)
        data=urllib.request.urlopen(req,timeout=100)
        data=data.read()
        fobj=open("E:\images\\"+str(count)+type,"wb")
        fobj.write(data)
        fobj.close()
        print("downloaded "+str(count)+type+"\n")
    except Exception as err:
        print(err)

4. Multithreading setting:

T=threading.Thread(target=download,args=(url,count))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)

for t in threads:
    t.join()
print("The End")

5. Operation results:

experience:

Through this experiment, I first reviewed the basic operation of downloading pictures from the website and how to use multithreading to speed up the download speed. The difference in this experiment is that you don't crawl pictures in the same single page. First, you need to find all its branches through the main url and download a limited number of pictures. The overall completion is good, and I have mastered it skillfully.

Operation ②:

My Gitee

Requirements: use the sketch framework to reproduce the operation ①.

(2) Download website pictures using the scratch framework

Experimental process:
1. First create the scene project

2. Write myspiders, create myspiders class, and define crawler name and start_urls

class myspides(scrapy.Spider):
    # Set basic parameters

    name = 'myspiders'
    start_urls = ["http://www.weather.com.cn"]

2. The same as operation 1, define a function to get picture links

    def image(self,link):
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"}
        r = requests.get(link, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        data = r.text
        soup = bs4.BeautifulSoup(data, "html.parser")
        images = soup.select("img")
        p = r'.+?src="(\S+)"'
        imagelist = []
        str1 = re.findall(p, str(images), re.S)
        # # print(str1)
        for i in range(len(str1)):
            if str1[i] not in imagelist:
                imagelist.append(str1[i]) if str1[i] else ""
                # print(imagelist)
                # print(len(imagelist))
        return imagelist

3. Obtain a[href] in the parse function

a = soup.select("a[href]")
                links = []
                for link in a:
                    links.append(link["href"]) if link["href"] != 'javascript:void(0)' else ""

4. Then download the pictures:

for i in images:
    item["images"]= i
    item["name"] = "E:/weather/" + str(count+1) + ".jpg"
    urllib.request.urlretrieve(item["images"], filename=item["name"])
    count += 1
    if count <=112:
       break
    yield item

5. In item.py, define the item items used in myspiders, where item["images"] and item["name"] are defined

6. Set in the setting.py file:
(note that the web crawler protocol should be ignored here)

ROBOTSTXT_OBEY = False

In other settings:

Operation results

experience

The main idea of the experiment is the same as that in assignment ①. It is mainly required to use the scratch framework for the experiment. It is estimated that the teacher is to let us skillfully use the scratch framework. After this experiment, we are really proficient. See Experiment III for the detailed reasons.

Operation ③:

My Gitee

Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
Candidate sites: https://movie.douban.com/top250

Serial number	Movie title	director	performer	brief introduction	Film rating	Film cover
1	The Shawshank Redemption	Frank delabond	Tim Robbin	Want to set people free	9.7	./imgs/xsk.jpg
2....

(3) Information about climbing watercress

Experimental process:
1. First observe the Douban website, and it can be found that the information to be crawled is distributed in the page:

However, a problem has been found here. The list of actors to be crawled is incomplete under this page and the director's name is chaotic. In order to improve the crawler, we decided to crawl this part of the link in the external interface, enter the link and crawl the actor information.

2. After analyzing the web page, we use the sketch framework to crawl the website. First, create the scene project.

3. Write the main function myspiders, create myspides (scratch. Spider): define the crawler name and start_urls

 name = 'myspiders'
 allowed_domains = ['movie.douban.com']
 start_urls = ['https://movie.douban.com/top250?start=0']

4. Encapsulate the parse(self, response) function, parse html, and query the data using xpath

 title = li.xpath("./div[@class='pic']/a/img/@alt").extract_first()
 img = li.xpath("./div[@class='pic']/a/img/@src").extract_first()
 score = li.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[position()=2]/text()").extract_first()
 rank = li.xpath("./div[@class='pic']/em/text()").extract_first()
 comment = li.xpath("./div[@class='info']/div[@class='bd']/p[position()=2]/span/text()").extract_first()

5. Image downloading:

file = "E:/movie/" + str(item["title"]) +  ".jpg"  # file means to create a related folder in the specified folder before crawling successfully
urllib.request.urlretrieve(item["img"], filename=file)
item["file"] = file.strip()

6. Continue to use xpath to crawl the link of the movie column, and enter the link to crawl the name of the director and actor

link = li.xpath("./div[@class='pic']/a/@href").extract_first()
soup = bs4.BeautifulSoup(r.text, "html.parser")
title = soup.select("div[id='info']")
reg = r'<a href=".*?" rel=".*?">(.*?)</a>'
actor = re.findall(reg,str(title))
# print(actor[0])
act = soup.select("span[class='actor'] span[class='attrs']")[0].text
# print(act)
item["actor"] = actor[0] if actor else ""
item["star"] = act if act else ""
yield item

7. Finally, turn pages of myspiders.py:

page = selector.xpath(
                    "//div[@class='paginator']/span[@class='thispage']/following-sibling::a[1]/@href").extract_first()
print(page)
link_nextpage = "https://movie.douban.com/top250" + str(page)
if page:
   url = response.urljoin(link_nextpage)
   yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)

8. Then process the item.py file and define the required items class

class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    img = scrapy.Field()
    link = scrapy.Field()
    score = scrapy.Field()
    comment = scrapy.Field()
    rank = scrapy.Field()
    actor = scrapy.Field()
    star = scrapy.Field()
    file = scrapy.Field()
    pass

9. To handle the settings, I only set 'demo.pipelines.DemoPipeline': 300 at the beginning. I found that no matter how I crawled, there was no content. I struggled to find the reason. Finally, I found that the website crawler protocol was not ignored, so I made the following modifications:

ROBOTSTXT_OBEY = False
USER_AGENT = 'demo (+http://www.yourdomain.com)'

10. Write pipelines, mainly for output processing and storage in the database, and create database tables here

        if flag:
            self.con = sqlite3.connect("movie.db")
            self.cursor = self.con.cursor()
            try:
                self.cursor.execute(
                    "create table movie (int Mrank ,Mtitle varchar(256),Mactor varchar(256),Mstar text,Mcomment varchar(256),Mscore varchar(256),Mfile varchar(256),constraint pk_movie primary key (Mrank,Mtitle))")
                flag = False
            except:
                self.cursor.execute("delete from movie")

insert data

self.cursor.execute("insert into movie (Mrank,Mtitle,Mactor,Mstar,Mcomment,Mscore,Mfile) values (?,?,?,?,?,?,?)",
                            (item["rank"], item["title"],item["actor"],item["star"], item["comment"], item["score"], item["file"]))

When inserting data into the database for the first time, I was surprised to find that several pieces of data were missing. Finally, I found that Mstar varchar(256) was originally set
Part of the data crawled is too long to be inserted. Finally, it is set to Mstar text, which solves the problem.

Operation results

You can see the list of actors and all climb down.

experience

In the process of completing this operation, we are faced with the problem that the website is abnormal and inaccessible for many times. It may be that there are too many crawls,

In the future, the html of the page must be cached first. After many tests, the crawler contacts the more complex logic for the first time and climbs into a page again. I feel that after learning selenium, it should not be so troublesome to deal with such problems. Secondly, I'm becoming more and more proficient in scratch. I feel that crawler is still a technology that needs to be practiced frequently. This time, I also use mysql and sqlite databases to store data. I feel similar. Finally, I thank teacher Wu Ling for her guidance in the practice class, otherwise I won't get out of some debug parts...

Programmer Think

The third experiment of data acquisition

Operation ①:

(1) Crawl the content of China Meteorological Website

experience:

Operation ②:

(2) Download website pictures using the scratch framework

experience

Operation ③:

(3) Information about climbing watercress

experience

Hot Topics