This article mainly realizes a simple crawler, the goal is downloads the picture from a Baidu pastes the bar page.
1. overview
This article mainly realizes a simple crawler, the goal is downloads the picture from a Baidu pastes the bar page. The steps to download the picture are as follows:
- Get html text content of web page;
- Analyze the html tag features of the images in html, and use regular to parse all the url links of the images;
- Download the pictures to the local folder according to the url link list of the pictures.
2. Implementation of urlib + re
#!/usr/bin/python # coding:utf-8 # To achieve a simple crawler, crawling Baidu Post Bar picture import urllib import re # Get html content of web page according to url def getHtmlContent(url): page = urllib.urlopen(url) return page.read() # Parse the url of all jpg images from html # The url format of jpg image in Baidu Post Bar html is: < img... SRC = "XXX. jpg" width =... > def getJPGs(html): # Analyzing the regularity of jpg image url jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width') # Note: the last 'width' here is to improve the matching accuracy # Parse out the url list of jpg jpgs = re.findall(jpgReg,html) return jpgs # Download the image with the image url and save it as a file name def downloadJPG(imgUrl,fileName): urllib.urlretrieve(imgUrl,fileName) # Download pictures in batch and save them to the current directory by default def batchDownloadJPGs(imgUrls,path = './'): # Used to name pictures count = 1 for url in imgUrls: downloadJPG(url,''.join([path,'{0}.jpg'.format(count)])) count = count + 1 # Package: download pictures from Baidu Post Bar def download(url): html = getHtmlContent(url) jpgs = getJPGs(html) batchDownloadJPGs(jpgs) def main(): url = 'http://tieba.baidu.com/p/2256306796' download(url) if __name__ == '__main__': main()
Run the above script and download it in a few seconds. You can see that the picture has been downloaded in the current directory:If you don't understand any difficulty in learning python, you can join my Python exchange study q u n 227-435-450, exchange more problems, help each other, and have good learning tutorials and development tools. If you have any questions about learning python (learning methods, learning efficiency, how to get employed), you can come to me at any time.
3. Implementation of requests + re
Next, Download with requests library, and re implement getHtmlContent and downloadJPG functions with requests.
#!/usr/bin/python # coding:utf-8 # To achieve a simple crawler, crawling Baidu Post Bar picture import requests import re # Get html content of web page according to url def getHtmlContent(url): page = requests.get(url) return page.text # Parse the url of all jpg images from html # The url format of jpg image in Baidu Post Bar html is: < img... SRC = "XXX. jpg" width =... > def getJPGs(html): # Analyzing the regularity of jpg image url jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width') # Note: the last 'width' here is to improve the matching accuracy # Parse out the url list of jpg jpgs = re.findall(jpgReg,html) return jpgs # Download the image with the image url and save it as a file name def downloadJPG(imgUrl,fileName): # Modules that automatically shut down requests and responses from contextlib import closing with closing(requests.get(imgUrl,stream = True)) as resp: with open(fileName,'wb') as f: for chunk in resp.iter_content(128): f.write(chunk) # Download pictures in batch and save them to the current directory by default def batchDownloadJPGs(imgUrls,path = './'): # Used to name pictures count = 1 for url in imgUrls: downloadJPG(url,''.join([path,'{0}.jpg'.format(count)])) print 'Download completed{0}Zhang picture'.format(count) count = count + 1 # Package: download pictures from Baidu Post Bar def download(url): html = getHtmlContent(url) jpgs = getJPGs(html) batchDownloadJPGs(jpgs) def main(): url = 'http://tieba.baidu.com/p/2256306796' download(url) if __name__ == '__main__': main()
Output: same as before.