python3.8 use urlib's request method and BeautifulSoup's formatting source code and search method to crawl multiple static general website images

Posted by gamerzfuse on Wed, 02 Feb 2022 17:44:38 +0100

preface

This paper mainly introduces the method of python crawling static web page pictures. It can crawl multiple web pages and is universal. At present, the method of dynamic web page image crawling has not been realized. Differences between static web pages and dynamic web pages:
Static web page image: the image url address is directly displayed in the web page source code.
Dynamic web page image: for example, for the web page of "Baidu image", the URL of the dynamic web page image is not directly exposed in the source code. The image is stored in the static resource file, such as static or image folder.

1, Method introduction

1. Method reference link

https://jingyan.baidu.com/article/46650658d73272f548e5f87c.html
https://www.jb51.net/article/141513.htm

2. Implementation principle

The implementation principle of the whole code is actually quite simple, which is based on import urlib The urlopen(url) method of request obtains the source code of the web page link. Of course, in general, it is necessary to add a user agent to simulate people to visit the website to prevent the anti crawler of the website. The specific methods are as follows:

requs=urllib.request.Request(url)
requs.add_header('User-Agent','user_agent')
urllib.request.urlopen(requs)

Then use the beautiful soup method to format the source code. Finally, find the corresponding tag according to the formatted code, and then find the corresponding value (call the beautiful soup search method), that is, the download address url of the image. Finally, use the method to download the image. Link to the official Chinese version of beautiful soup:

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

The key point here is to determine the download address of the picture. The method of obtaining the address of the picture may be different for different websites. If I write this, many ordinary websites should be universal. It's really not possible at that time. You can modify the method of obtaining the download address of the picture according to the corresponding website code.

3. Introduction to other methods

Syntax of Beautiful Soup
html web page - > create beautiful soup object - > search node find_all() / find() - > access node, name, attribute, text, etc
In fact, in addition to finding the corresponding image download link according to the tag value, regular expressions can also be used.

2, Use steps

0. Reference blog

Blog to start with:
It feels great to search for blogs with key sentences: python gets all the picture links of an html page

https://blog.csdn.net/Drifter_Galaxy/article/details/104886684

1. Import and storage

Import the class library to be used:

import urllib.request  # Import the extension library module used to open the URL
import urllib.parse
from bs4 import BeautifulSoup # This method is not used in the previous method, and the improved code needs to use this module method
import re  # Import regular expression module

2. Problems encountered

#The following lines are related problems and reference links: I won't post specific solutions here. I need to go to the corresponding page to check

#urllib.error.HTTPError: HTTP Error 418: user agent needs to be added

#IndexError: list assignment index out of range solution:
https://blog.csdn.net/qq_44768814/article/details/88614393
#python downloads pictures or packages through links:
https://blog.csdn.net/qq_34783484/article/details/95314582
#Positional argument after keyword argument: 
https://blog.csdn.net/swy_swy_swy/article/details/106290505
#python, how to get substrings in strings, partial strings: 
https://www.cnblogs.com/chuanzhang053/p/10006768.html
#Tutorial on judging the beginning of a string with the startswitch() function in Python: 
https://www.jb51.net/article/63589.htm
#TypeError: 'int' object is not callable: this problem has not been solved. I don't think that's the reason. I think there are more parameters in the startswitch() method
https://blog.csdn.net/gaifuxi9518/article/details/81193296 
#AttributeError: 'str' object has no attribute 'startwith': note that the function name should be written correctly! Change startwith to startswitch!
https://blog.csdn.net/u010244992/article/details/104554484/ 
#In python, the length of the string is obtained through the method len(str)
https://www.cnblogs.com/chuanzhang053/p/10006856.html 
#Unicode encodeerror: 'ASCII' codec can't encode characters in position 38-43: I didn't solve this problem, I didn't manage it directly (the reason for the error is that the crawl address I entered contains Chinese)
https://blog.csdn.net/zhangxbj/article/details/44975129 
#python crawler crawls Baidu pictures: the reason why baidu pictures cannot be crawled is that Baidu picture web pages are dynamic, while the method I write is to crawl static web pages

3. The image download address used

# Define url
# url = "https://movie.douban.com/top250"
# url = "https://movie.douban.com/chart"
#//img.ivsky.com/img/tupian/li/202011/30/laju-005.jpg the following page has a relative path (not including http: request header), which needs to be added to crawl successfully
#url = "https://www.ivsky.com/tupian/zhiwuhuahui/"
#/static/res/pichp/imgs/picLogo. The page below GIF contains the < img SRC link that does not start with http
# url = "http://pic.yxdown.com/list/0_0_1.html"
# url = " https://m.baidu.com/sf/vsearch?pd=image_content&word= Download forest pictures & TN = vsearch & ATN = page“
#The following URL can't crawl the picture link, because it's Baidu's dynamic page. The picture link is not directly exposed. The method I'm writing now can only crawl the static page
# url = "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%CD%BC%C6%AC%CF%C2%D4%D8&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=000000"

4. All codes

The url used here is the image web page link that needs to be crawled. The comments inside are too annoying and can be deleted directly. I keep them for convenience of viewing and knowing the writing process.

#! author: fenggbinn
#@date: 2021-06-09 PM - 2021-06-10 00:21 am
# One way to solve this problem is to write a lot of online queries
# I think it's good to crawl pictures from this website: https://www.ivsky.com/tupian/dongwutupian/
# When crawling, I encountered the need to use user agent. Baidu method searched for a long time, but it was found that there was a problem in finding user agent in its own browser

#The following lines are related questions and reference links
#IndexError: list assignment index out of range solution: https://blog.csdn.net/qq_44768814/article/details/88614393
#python downloads pictures or packages through links: https://blog.csdn.net/qq_34783484/article/details/95314582
#Positional argument after keyword argument: https://blog.csdn.net/swy_swy_swy/article/details/106290505
#python, how to get substrings in strings, partial strings: https://www.cnblogs.com/chuanzhang053/p/10006768.html
#Tutorial on judging the beginning of a string with the startswitch() function in Python: https://www.jb51.net/article/63589.htm
#TypeError: 'int' object is not callable:  https://blog.csdn.net/gaifuxi9518/article/details/81193296 This problem hasn't been solved. I don't think that's the reason. I think it's because the startswitch () method has more parameters written in it
#AttributeError: 'str' object has no attribute 'startwith':  https://blog.csdn.net/u010244992/article/details/104554484/ Note that the function name should be written correctly! Change startwith to startswitch!
#In python, get the length of the string: https://www.cnblogs.com/chuanzhang053/p/10006856.html Obtained by method len(str)
#UnicodeEncodeError: 'ascii' codec can't encode characters in position 38-43: https://blog.csdn.net/zhangxbj/article/details/44975129 I didn't solve this problem. I didn't manage it directly (the reason for the error is that the crawling address I entered contains Chinese)
#python crawler crawls Baidu pictures: the reason why baidu pictures cannot be crawled is that Baidu picture web pages are dynamic, while the method I write is to crawl static web pages


import urllib.request  # Import the extension library module used to open the URL
import urllib.parse
from bs4 import BeautifulSoup # This method is not used in the previous method, and the improved code needs to use this module method
import re  # Import regular expression module

#copy from web
def open_url(url):
    req = urllib.request.Request(url)  # Instantiate the Request class and pass in the url as the initial value, and then assign it to req
    # Add a header and pretend to be a browser
    # req.add_header('User-Agent',
    #                'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 '
    #                'Safari/537.36 SE 2.X MetaSr 1.0')

    #Second request header#urllib.error.HTTPError: HTTP Error 418: I'm a teapot
    req.add_header('User-Agent',
                   'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36')


    # Access the url and assign the binary data of the page to page
    page = urllib.request.urlopen(req)
    # Convert the content in the page to utf-8 encoding
    html = page.read().decode('utf-8')

    return html

#Write your own way to get picture links
def get_allImgUrl(html):
    imgurl = BeautifulSoup(html,"html.parser")
    imall=[]

    #Test to obtain all picture links. You can modify the corresponding methods or add judgment statements according to the specific website
    for imgu in imgurl.find_all('img'):
        im=imgu.get('src')
        print(im+'000000000000')
        #Determine whether the link is valid
        if im.startswith('http'):
            imall.append(im)
            pass
        elif im.startswith('//'):
            imall.append('http:'+im)
        # imall.append(im)
    #     print(im)
        # print(imgu)
    #Test whether to get the picture link
    for j in imall:
        print(j+'9999999999')
        pass
    return imall

#Image download method rewritten according to copy from web method
#imall is the link address of all pictures obtained
def downloadPictures(imall):
    # Loop through each value of the list
    for img in imall:
        # With / as the separator, - 1 returns the last value
        filename = img.split("/")[-1]
        #Add judgment picture link / / img ivsky. com/img/tupian/li/202011/30/laju-005. jpg
        #Some pictures may be preceded by / / double slashes
        #There is no problem with the above one. You don't need to judge. What you need to judge is that some don't start with http: and only use static files, such as logo related pictures. Otherwise, an error will be reported. This judgment should be written when adding links to the list
        #When I test again, I find that / / the double slash still needs to be judged, otherwise it will not be recognized as a link after reading. Judge it above
        # if img.startswith('//'):
        #     img=img[2:len(img)+1]
        #     print(img)
        #     pass
        # else:
        #     print('fale')

        '''
        Function: startswith()
        
        Function: judge whether the string starts with the specified character or substring
        
        1, Function description
        Syntax: string.startswith(str, beg=0,end=len(string))
               or string[beg:end].startswith(str)
         
        Parameter Description:
        string:   Detected string
        str:       The specified character or substring. (tuples can be used and will be matched one by one)
        beg:     Set the starting position of string detection (optional)
        end:     Set the end position of string detection (optional)
        If parameters exist beg and end，Then it is checked within the specified range, otherwise it is checked in the whole string
        
        Return value
        Returns if a string is detected True，Otherwise return False. The default null character is True
        
        String parsing function: if string So str Start, return True，Otherwise return False
        '''
        request_download(img, savepath="d:/test/t2",filename=filename)
        # # Visit each and assign the binary data of the page to photo
        # photo = urllib.request.urlopen(img)
        # w = photo.read()
        # # Opens the specified file and allows binary data to be written
        # f = open('D:/test/' + filename, 'wb')
        # # Write acquired data
        # f.write(w)
        # # Close file
        # f.close()
        # print(filename + " have been download...")

#Download the code according to the link found elsewhere on the Internet (rewrite it)
def request_download(IMAGE_URL,savepath,filename):
    import requests
    r = requests.get(IMAGE_URL)
    with open(savepath+"/"+filename, 'wb') as f:
        f.write(r.content)
        f.close()
        print(filename+": downloaded")

#copy from web (the following method is the previous one. After I improve the code later, this method is not used. This method is very rigid and can only crawl fixed pages. The improved code can crawl multiple different static pages
def get_img(html):
    # [^ "] + \. jpg matches all characters except" multiple times, followed by escaped characters And png
    p = r'(http.:[\S]*?.(jpg|jpeg|png|gif|bmp|webp))'
    # Returns a list of all matching results of a regular expression in a string
    imglist = re.findall(p, html)
    print("List of Img: " + str(imglist))
    # Loop through each value of the list
    for img in imglist:
        # With / as the separator, - 1 returns the last value
        filename = img[0].split("/")[-1]
        # Visit each and assign the binary data of the page to photo
        photo = urllib.request.urlopen(img[0])
        w = photo.read()
        # Opens the specified file and allows binary data to be written
        f = open('D:/test/' + filename, 'wb')
        # Write acquired data
        f.write(w)
        # Close file
        f.close()
        print(filename + " have been download...")


# This module can be imported into other modules for use. In addition, this module can also execute itself
if __name__ == '__main__':
    # Define url
    # url = "https://movie.douban.com/top250"
    # url = "https://movie.douban.com/chart"#urllib.error.HTTPError: HTTP Error 418: I'm a teapot
    url = "https://www.ivsky.com/tupian/zhiwuhuahui/"#//img.ivsky.com/img/tupian/li/202011/30/laju-005.jpg this page has a relative path (excluding http: request header, which needs to be added to successfully crawl)
    # url = "http://pic.yxdown.com/list/0_0_1.html"#/static/res/pichp/imgs/picLogo.gif000000000000000000 this page contains a < img SRC link that does not start with http
    # url = " https://m.baidu.com/sf/vsearch?pd=image_content&word= Download forest pictures & TN = vsearch & ATN = page“
    #The following URL can't crawl the picture link, because it's Baidu's dynamic page. The picture link is not directly exposed. The method I'm writing now can only crawl the static page
    # url = "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%CD%BC%C6%AC%CF%C2%D4%D8&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=000000"
    # Use url as open_url(), and then open_ The return value of url () is assigned to get as a parameter_ img()
    # get_img(open_url(url))
    # get_allImgUrl(open_url(url))
    downloadPictures(get_allImgUrl(open_url(url)))
    print("all over...")

summary

This article mainly uses python to crawl the pictures of web pages in batches and download them directly to the local. Of course, I learned that I should use it flexibly. I can not only find the address of the picture download according to the label, but also directly replace the corresponding content if I want to download other content.
This article is just my first contact with python crawlers. There should be many places I haven't written well. Some methods are not the best. If you have any suggestions, you can leave a message directly.

other

1.CSDN article font color

Writing this article, I suddenly found that < font color = #999aaa '>' can be used to adjust the font color on CSDN.
Color = the hexadecimal color code behind can be changed according to your preference.
Post a hexadecimal color code reference link:

https://blog.csdn.net/shakespeare001/article/details/7816022

2. Blog link storage

I put all the links in this article into the code block, which is convenient for review. There are too many links, and there are often problems in the review.

Topics: Python crawler beautifulsoup

Programmer Think