A simple crawler instance of python

Posted by fighter1430 on Sun, 05 Jan 2020 04:26:06 +0100

This article mainly realizes a simple crawler, the goal is downloads the picture from a Baidu pastes the bar page.

1. overview

This article mainly realizes a simple crawler, the goal is downloads the picture from a Baidu pastes the bar page. The steps to download the picture are as follows:

Get html text content of web page;
Analyze the html tag features of the images in html, and use regular to parse all the url links of the images;
Download the pictures to the local folder according to the url link list of the pictures.

2. Implementation of urlib + re

#!/usr/bin/python
# coding:utf-8
# To achieve a simple crawler, crawling Baidu Post Bar picture
import urllib
import re

# Get html content of web page according to url
def getHtmlContent(url):
    page = urllib.urlopen(url)
    return page.read()

# Parse the url of all jpg images from html
# The url format of jpg image in Baidu Post Bar html is: < img... SRC = "XXX. jpg" width =... >
def getJPGs(html):
    # Analyzing the regularity of jpg image url
    jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width')  # Note: the last 'width' here is to improve the matching accuracy
    # Parse out the url list of jpg
    jpgs = re.findall(jpgReg,html)
    
    return jpgs

# Download the image with the image url and save it as a file name
def downloadJPG(imgUrl,fileName):
    urllib.urlretrieve(imgUrl,fileName)
    
# Download pictures in batch and save them to the current directory by default
def batchDownloadJPGs(imgUrls,path = './'):
    # Used to name pictures
    count = 1
    for url in imgUrls:
        downloadJPG(url,''.join([path,'{0}.jpg'.format(count)]))
        count = count + 1

# Package: download pictures from Baidu Post Bar
def download(url):
    html = getHtmlContent(url)
    jpgs = getJPGs(html)
    batchDownloadJPGs(jpgs)
    
def main():
    url = 'http://tieba.baidu.com/p/2256306796'
    download(url)
    
if __name__ == '__main__':
    main()

Run the above script and download it in a few seconds. You can see that the picture has been downloaded in the current directory:
If you don't understand any difficulty in learning python, you can join my Python exchange study q u n 227-435-450, exchange more problems, help each other, and have good learning tutorials and development tools. If you have any questions about learning python (learning methods, learning efficiency, how to get employed), you can come to me at any time.

3. Implementation of requests + re

Next, Download with requests library, and re implement getHtmlContent and downloadJPG functions with requests.

#!/usr/bin/python
# coding:utf-8
# To achieve a simple crawler, crawling Baidu Post Bar picture
import requests
import re

# Get html content of web page according to url
def getHtmlContent(url):
    page = requests.get(url)
    return page.text

# Parse the url of all jpg images from html
# The url format of jpg image in Baidu Post Bar html is: < img... SRC = "XXX. jpg" width =... >
def getJPGs(html):
    # Analyzing the regularity of jpg image url
    jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width')  # Note: the last 'width' here is to improve the matching accuracy
    # Parse out the url list of jpg
    jpgs = re.findall(jpgReg,html)
    
    return jpgs

# Download the image with the image url and save it as a file name
def downloadJPG(imgUrl,fileName):
    # Modules that automatically shut down requests and responses
    from contextlib import closing
    with closing(requests.get(imgUrl,stream = True)) as resp:
        with open(fileName,'wb') as f:
            for chunk in resp.iter_content(128):
                f.write(chunk)
    
# Download pictures in batch and save them to the current directory by default
def batchDownloadJPGs(imgUrls,path = './'):
    # Used to name pictures
    count = 1
    for url in imgUrls:
        downloadJPG(url,''.join([path,'{0}.jpg'.format(count)]))
        print 'Download completed{0}Zhang picture'.format(count)
        count = count + 1

# Package: download pictures from Baidu Post Bar
def download(url):
    html = getHtmlContent(url)
    jpgs = getJPGs(html)
    batchDownloadJPGs(jpgs)
    
def main():
    url = 'http://tieba.baidu.com/p/2256306796'
    download(url)
    
if __name__ == '__main__':
    main()

Output: same as before.

Topics: Python

Programmer Think

A simple crawler instance of python

1. overview

2. Implementation of urlib + re

3. Implementation of requests + re

Hot Topics