Python crawlers grab pure static websites and their resources! This project earned 10k!

Posted by hass1980 on Thu, 12 Sep 2019 11:03:45 +0200

Requirements encountered

Some time ago, we need to quickly make a static display page, which requires a more responsive and beautiful. Because of the short time and the trouble of writing by oneself, I intend to go online to find ready-made words.

In the middle of the way, I found several pages that were good, and then I began to think about how to download the pages. "

Compile a set of Python materials and PDF, Python learning materials can be added to the learning group if necessary: 631441315, anyway, idle is also idle, it is better to learn something!~~

Since I haven't known about crawlers before, I naturally didn't think that crawlers can be used to grab web content. So my approach is:

  1. Open the chrome console and enter the Application option
  2. Find the Frames option, find the html file, and right-click Save As...
  3. Manually create the local js/css/images directory
  4. Open Images/Scripts/Stylesheets under the Frames option in turn, and right-click Save As for a file.

This was the best way I could think of at that time. However, this artificial method has the following shortcomings:

  1. Manual operation, troublesome and time-consuming
  2. If you are not careful, you will forget which file to save.
  3. It's hard to deal with the relationship between paths, such as a picture a.jpg, which is referenced in html by images/banner/a.jpg, so we have to solve path dependencies manually in the future.

Then just a few days ago, I contacted a little Python and thought that I could write a python crawler to help me automatically crawl static websites. So immediately start, refer to relevant information and so on.

Next, I will share with you the whole process of crawler crawling static websites in detail.

Pre-knowledge Reserve

In the following code practice, we use python knowledge, regular expressions and so on. The core technology is regular expressions.

Let's have a look.

Basic knowledge of Python

If you've ever studied other languages before, I'm sure you can start python soon. For specific learning, you can check the official python documentation or other tutorials.

The concept of reptiles

Crawler, as I understand it, is actually an automatic computer program. In the field of web, the premise of its existence is to simulate the behavior of users in the browser.

Its principle is to simulate the user visiting the web page, get the content of the web page, then analyze the content of the web page, find out what we are interested in, and finally process the data.

The flow chart is as follows:

 

There are several popular reptilian mainstream implementations:

  1. Grab the content of the web page by yourself, and then implement the analysis process by yourself
  2. Crawler frameworks written by others, such as Scrapy

regular expression

concept

Regular expression is a string composed of a series of meta-characters and common characters. Its function is to match the text according to certain rules, and finally make a series of processing for the text.

Metacharacters are reserved characters in regular expressions. They have special matching rules, such as * representing matching 0 to infinite times, ordinary characters are abcd and so on.

For example, in the front-end, a common operation is to determine whether the user's input is empty. At this time, we can match it by regular expression. First, we filter out the blank values on both sides of the user's input. The specific implementation is as follows:

function trim(value) {
 return value.replace(/^\s+|\s+$/g, '')
}
// Output = > "Python Reptile"
trim(' Python Reptile ');

 

Let's take a look at the metacharacters in regular expressions.

Metacharacters in Regular Expressions

Above, we said that meta-characters are reserved characters in regular expressions. They have special matching rules, so we should first understand the frequently occurring meta-characters.

Metacharacters matching a single character

  • Represents matching an arbitrary character, except for n (newline character), such as matching arbitrary alphanumeric, etc.
  • [...] denotes a character group, which can contain any character. It only matches any one of them. For example, [a b c] can match a or b or C. It is worth noting that the meta-characters in a character group are sometimes treated as ordinary characters, such as [-*?] and so on. It represents only - or * or?, not - intervals.* Represents 0 to infinite matches, 0 or 1 matches.
  • Contrary to the meaning of [...], it means matching a character that does not belong to [...], not a character that does not match [...]. These two statements are subtle but very different. The former stipulates that a character must be matched, which is a must.

Example: [^ 123] can match 4/5/6, etc., but not 1/2/3

Metacharacters that provide counting

  • * Represents matching 0 to infinite times, and can not match any character
  • + Represents matching once to infinite times, at least once
  • Represents matching 0 or 1 times
  • {min, max} represents matching min times to Max times, such as a{3, 5} means that a matches at least 3-5 times.

Metacharacters that provide location

  • ^ Represents the beginning of a matching string, such as ^ a indicates that a will appear at the beginning of a string, and bcd does not match.
  • $represents the end of the matching string, such as A $indicates that A will appear at the end of the string, and ABAB does not match.

Other metacharacters

  • | Represents a range that can match arbitrary subexpressions, such as abc|def, abc or def, and abd
  • (...) Represents grouping. It defines the scope of subexpressions and combines them with metacharacters that provide functionality, such as (abc|def) + Represents ABC or def that can match one or more times, such as abcabc, def
  • \ i stands for reverse reference, i can be an integer of 1/2/3, which means matching content in the previous (). For example, matching (abc) +(12)* 1 2. If the matching is successful, the content of 1 is abc, and the content of 2 is 12 or empty. Back references are usually used in matching "" or "'

Look around

What I understand is that looping defines what happens to the left and right text of the current matching subexpression. The looping itself does not occupy the matching character. It is the matching rule of the current subexpression but does not count itself into the matching text. The meta-characters mentioned above all represent certain rules and occupy certain characters. Circumference can be divided into four categories: positive sequence circumference, negative sequence circumference, positive reverse sequence circumference and negative reverse sequence circumference. Their workflow is as follows:

  • Affirmative sequential lookup: First find the initial position of the text in the lookup on the right, and then start matching characters from the leftmost position of the matched text on the right.
  • Negative sequential lookup: First find the initial position of the text in the lookup that does not appear on the right, and then start matching characters from the leftmost position of the matched text on the right.
  • Affirmative reverse looping: First find the initial position of the text in the looping on the left, and then start matching characters from the rightmost position of the matched text on the left.
  • Negative reverse looping: First find the initial position of the text in the looping that does not appear on the left, and then start matching characters from the rightmost position of the matched text on the left.

Affirmative sequential glance

The condition for successful sequential looping matching is that the current subexpression can match the right text, which is written as (?=...),... Representing the content to be looped. For example, the regular expression (?= he llo) he means to match the text containing hello. It only matches the location, does not match the specific characters. After matching to the location, it really matches the occupied character is he, so it can match LLO and so on.

For (?= hello)he, hello world can match successfully, while hell world fails. The code is as follows:

import re
reg1 = r'(?=hello)he'
print(re.search(reg1, 'hello world'))
print(re.search(reg1, 'hell world hello'))
print(re.search(reg1, 'hell world'))
# Output result
<_sre.SRE_Match object; span=(0, 2), match='he'>
<_sre.SRE_Match object; span=(11, 13), match='he'>
None

 

Negative Sequence Looking around

Negative sequential lookup matching succeeds only if the current subexpression does not match the right text. It is written in (?!),... To represent the content to be looped, or in the example above, such as the regular expression (?! hello)he, which means matching the text that is not hello, locating it, and then matching him.

Examples are as follows:

import re
reg2 = r'(?!hello)he'
print(re.search(reg2, 'hello world'))
print(re.search(reg2, 'hell world hello'))
print(re.search(reg2, 'hell world'))
# Output result
None
<_sre.SRE_Match object; span=(0, 2), match='he'>
<_sre.SRE_Match object; span=(0, 2), match='he'>

 

Positive Reverse Sequence Looking around

The positive condition for successful reverse looping matching is that the current subexpression can match the left text. It is written as (?<=...),... Representing the content to be looped, such as the regular expression (?<= hello) - Python means to match the subexpression containing - python, and its left side must appear hello, Hello matches only the location, not the same. With specific characters, the real occupied character is the following - python.

Examples are as follows:

import re
reg3 = r'(?<=hello)-python'
print(re.search(reg3, 'hello-python'))
print(re.search(reg3, 'hell-python hello-python'))
print(re.search(reg3, 'hell-python'))
# Output result
<_sre.SRE_Match object; span=(5, 12), match='-python'>
<_sre.SRE_Match object; span=(17, 24), match='-python'>
None

 

Negative Reverse Sequence Looking around

Negative reverse looping matching success condition is that the current sub-expression can not match the left text, it is written (?<!),... Representing the content to be looped, such as regular expression (?<! Hello) - Python means to match the sub-expression containing - python, and its left must not appear hello.

Examples are as follows:

import re
reg3 = r'(?<=hello)-python'
print(re.search(reg3, 'hello-python'))
print(re.search(reg3, 'hell-python hello-python'))
print(re.search(reg3, 'hell-python'))
# Output result
<_sre.SRE_Match object; span=(5, 12), match='-python'>
<_sre.SRE_Match object; span=(17, 24), match='-python'>
None

 

Looking around is very effective in inserting certain characters into a string. You can use it to match positions and then insert the corresponding characters without replacing the original text.

Capture packet

In regular expressions, grouping can help us extract specific information we want.

It's easy to specify grouping by adding () to both ends of the expression you want to capture. In python, we can use re.search(reg, xx).groups() to get all the groups.

The default () specifies a group whose serial number is I and I is retrieved by re.search(reg, xx).group(i) from 1.

If you don't want to capture a packet, you can use (?:...) to specify it.

Specific examples are as follows:

import re
reg7 = r'hello,([a-zA-Z0-9]+)'
print(re.search(reg7, 'hello,world').groups())
print(re.search(reg7, 'hello,world').group(1))
print(re.search(reg7, 'hello,python').groups())
print(re.search(reg7, 'hello,python').group(1))
# Output result
('world',)
world
('python',)
python

 

Greedy matching

Greedy matching means that regular expressions match as many characters as possible, that is, they tend to match the maximum length.

Regular expressions default to greedy mode.

Examples are as follows:

import re
reg5 = r'hello.*world'
print(re.search(reg5, 'hello world,hello python,hello world,hello javascript'))
# Output result
<_sre.SRE_Match object; span=(0, 36), match='hello world,hello python,hello world'>

 

As you can see from the above, it matches hello world,hello python,hello world, not the first hello world. So if we just want to match the initial Hello world, then we can use the non-greedy pattern of regular expressions.

Non-greedy matching is just the opposite of greedy matching. It means matching as few characters as possible and ending as long as the matching is done. To use the greedy pattern, just add a question mark (?) after the quantifier.

Or just that example:

import re
reg5 = r'hello.*world'
reg6 = r'hello.*?world'
print(re.search(reg5, 'hello world,hello python,hello world,hello javascript'))
print(re.search(reg6, 'hello world,hello python,hello world,hello javascript'))
# Output result
<_sre.SRE_Match object; span=(0, 36), match='hello world,hello python,hello world'>
<_sre.SRE_Match object; span=(0, 11), match='hello world'>

 

As you can see from above, this is the effect we just wanted to match.

Entry into development

With the above basic knowledge, we can enter the development process.

The ultimate effect we want to achieve

Our ultimate goal is to write a simple python crawler, which can download a static web page and download its static resources (such as js/css/images) under the relative path of maintaining the reference resources of the web page. The test website is http://www.peersafe.cn/index.html. The results are as follows:

 

Development process

Our general idea is to get the content of the web page first, then use regular expressions to extract the resource links we want, and finally download the resources.

Getting Web Content

We use urllib.http, which comes with Python 3, to make http requests, or you can use third-party request libraries requests.

Part of the code to get the content is as follows:

url = 'http://www.peersafe.cn/index.html'
# Read Web Content
webPage = urllib.request.urlopen(url)
data = webPage.read()
content = data.decode('UTF-8')
print('> The content of the website is captured and the length of the content is as follows:', len(content))

 

After getting the content, we need to save it, that is, write it to the local disk. We define a SAVE_PATH path, which represents a file specifically placed for crawler downloads.

# python-spider-downloads It's the catalogue we want to place.
# Recommended here os Module to get the current directory or splice path
# Direct use is not recommended'F://Xxx'+'//python-spider-downloads' and so on
SAVE_PATH = os.path.join(os.path.abspath('.'), 'python-spider-downloads')

 

The next step is to create a separate folder for the site. The site folder format is XX xx-xx-xx-domain, such as 2018-08-03-www.peersafe.cn. Before that, we need to write a function to extract a url link domain name, relative path, request file name and request parameters, etc., which will be used in the subsequent creation of the corresponding folder according to the reference mode of the resource file.

For example, if you enter http://www.peersafe.cn/index.html, you will output:

{'baseUrl': 'http://www.peersafe.cn', 'fullPath': 'http://www.peersafe.cn/', 'protocol': 'http://', 'domain
': 'www.peersafe.cn', 'path': '/', 'fileName': 'index.html', 'ext': 'html', 'params': ''}

 

Part of the code is as follows:

REG_URL = r'^(https?://|//)?((?:[a-zA-Z0-9-_]+\.)+(?:[a-zA-Z0-9-_:]+))((?:/[-_.a-zA-Z0-9]*?)*)((?<=/)[-a-zA-Z0-9]+(?:\.([a-zA-Z0-9]+))+)?((?:\?[a-zA-Z0-9%&=]*)*)$'
regUrl = re.compile(REG_URL)
# ...
'''
//Resolving URL addresses
'''
def parseUrl(url):
 if not url:
 return
 res = regUrl.search(url)
 # Here we put 192.168.1.109:8080 Forms are also resolved into domain names domain,Actual process www.baidu.com Wait is the domain name, 192.168.1.109 just IP address
 # ('http://', '192.168.1.109:8080', '/abc/images/111/', 'index.html', 'html', '?a=1&b=2')
 if res is not None:
 path = res.group(3)
 fullPath = res.group(1) + res.group(2) + res.group(3)
 if not path.endswith('/'):
 path = path + '/'
 fullPath = fullPath + '/'
 return dict(
 baseUrl=res.group(1) + res.group(2),
 fullPath=fullPath,
 protocol=res.group(1),
 domain=res.group(2),
 path=path,
 fileName=res.group(4),
 ext=res.group(5),
 params=res.group(6)
 )
'''
//Analytic path
eg:
 basePath => F:\Programs\python\python-spider-downloads
 resourcePath => /a/b/c/ or a/b/c
 return => F:\Programs\python\python-spider-downloads\a\b\c
'''
def resolvePath(basePath, resourcePath):
 # Resolving Resource Path
 res = resourcePath.split('/')
 # Remove empty directories /a/b/c/ => [a, b, c]
 dirList = list(filter(lambda x: x, res))
 # Catalogue is not empty
 if dirList:
 # Stitching out absolute paths
 resourcePath = reduce(lambda x, y: os.path.join(x, y), dirList)
 dirStr = os.path.join(basePath, resourcePath)
 else:
 dirStr = basePath
 return dirStr

 

The above regular expression REG_URL is a bit long. This regular expression can parse all kinds of url forms I encounter at present. If you can't parse them, you can supplement them by yourself. The list of URLs I tested can be viewed in my github.

First of all, for the most complex url links (such as'http://192.168.1.109:8080/abc/images/111/index.html? A=1&b=2'), we want to extract http:/, 192.168.1.109:8080, / abc/images/111/, index.html,?A=1&b=2, respectively. The purpose of extracting / ABC / images / 111 / is to prepare for future directory creation. Index. html is the name of the content written to the web page.

If necessary, we can study the REG_URL in depth. If there is something better or we can't understand it, we can discuss it together.

With the parseUrl function, we can associate what we just got from the web page with what we wrote to the file. The code is as follows:

# First create the folder for this site
urlDict = parseUrl(url)
print('Analytical domain names:', urlDict)
domain = urlDict['domain']
filePath = time.strftime('%Y-%m-%d', time.localtime()) + '-' + domain
# If it is 192.168.1.1:8000 Equal form, 192.168.1.1-8000,:Can't appear in the filename
filePath = re.sub(r':', '-', filePath)
SAVE_PATH = os.path.join(SAVE_PATH, filePath)
# Read Web Content
webPage = urllib.request.urlopen(url)
data = webPage.read()
content = data.decode('UTF-8')
print('> The content of the website is captured and the length of the content is as follows:', len(content))
# Write down the content of the website
pageName = ''
if urlDict['fileName'] is None:
 pageName = 'index.html'
else:
 pageName = urlDict['fileName']
pageIndexDir = resolvePath(SAVE_PATH, urlDict['path'])
if not os.path.exists(pageIndexDir):
 os.makedirs(pageIndexDir)
pageIndexPath = os.path.join(pageIndexDir, pageName)
print('Address of Home Page:', pageIndexPath)
f = open(pageIndexPath, 'wb')
f.write(data)
f.close()

 

Extracting useful resource links

The resources we want are image resources, JS files, css files and font files. If we want to parse the content one by one and use grouping to capture the links we want, such as images/1.png and scripts/lib/jquery.min.js.

The code is as follows:

REG_RESOURCE_TYPE = r'(?:href|src|data\-original|data\-src)=["\'](.+?\.(?:js|css|jpg|jpeg|png|gif|svg|ico|ttf|woff2))[a-zA-Z0-9\?\=\.]*["\']'
# re.S Represents opening multi-line matching mode
regResouce = re.compile(REG_RESOURCE_TYPE, re.S)
# ...
# Parse the content of web pages to get effective links
# content Is the content of the page read in the previous step
contentList = re.split(r'\s+', content)
resourceList = []
for line in contentList:
 resList = regResouce.findall(line)
 if resList is not None:
 resourceList = resourceList + resList

 

Download resources

After resolving the resource links, we need to check each resource link and turn it into a url format that meets the HTTP request, such as adding images/1.png to the HTTP header and the just domain, which is http://domain/images/1.png.

The following is the code for processing resource links:

# ./static/js/index.js
# /static/js/index.js
# static/js/index.js
# //abc.cc/static/js
# http://www.baidu/com/static/index.js
if resourceUrl.startswith('./'):
 resourceUrl = urlDict['fullPath'] + resourceUrl[1:]
elif resourceUrl.startswith('//'):
 resourceUrl = 'https:' + resourceUrl
elif resourceUrl.startswith('/'):
 resourceUrl = urlDict['baseUrl'] + resourceUrl
elif resourceUrl.startswith('http') or resourceUrl.startswith('https'):
 # No, that's what we want. url format
 pass
elif not (resourceUrl.startswith('http') or resourceUrl.startswith('https')):
 # static/js/index.js This situation
 resourceUrl = urlDict['fullPath'] + resourceUrl
else:
 print('> Unknown resource url: %s' % resourceUrl)

 

Next, parse Url is used to extract the directory and file name of each specification, and then create the corresponding directory.

Here, I also deal with the resources of other sites cited.

# Parse files and view file paths
resourceUrlDict = parseUrl(resourceUrl)
if resourceUrlDict is None:
 print('> Error parsing file:%s' % resourceUrl)
 continue
resourceDomain = resourceUrlDict['domain']
resourcePath = resourceUrlDict['path']
resourceName = resourceUrlDict['fileName']
if resourceDomain != domain:
 print('> This resource is not available on this website. It is also downloaded:', resourceDomain)
 # If you download it, the root directory will change.
 # Create a directory to save resources elsewhere
 resourceDomain = re.sub(r':', '-', resourceDomain)
 savePath = os.path.join(SAVE_PATH, resourceDomain)
 if not os.path.exists(SAVE_PATH):
 print('> The target directory does not exist. Create:', savePath)
 os.makedirs(savePath)
 # continue
else:
 savePath = SAVE_PATH
# Resolving Resource Path
dirStr = resolvePath(savePath, resourcePath)
if not os.path.exists(dirStr):
 print('> The target directory does not exist. Create:', dirStr)
 os.makedirs(dirStr)
# write file
downloadFile(resourceUrl, os.path.join(dirStr, resourceName))

 

The downloadFile function downloads the following code:

'''
//Download File
'''
def downloadFile(srcPath, distPath):
 global downloadedList
 if distPath in downloadedList:
 return
 try:
 response = urllib.request.urlopen(srcPath)
 if response is None or response.status != 200:
 return print('> Request exception:', srcPath)
 data = response.read()
 f = open(distPath, 'wb')
 f.write(data)
 f.close()
 downloadedList.append(distPath)
 # print('>>>: ' + srcPath + ': Download successful')
 except Exception as e:
 print('Wrong report:', e)

 

These are the whole process of our development.

Knowledge summary

The technology used in this development

  1. Using urllib.http to send network requests
  2. Resolving Resource Links Using Regular Expressions
  3. Using os system module to deal with file path problem

Experience

This article is also a practical summary of my study of python during this period, by the way, record the knowledge of regular expressions. At the same time, I hope to be able to help those who want to learn regular expressions and crawlers.

Topics: Python Javascript Lambda github