Reptile - English novel_ analysis

Posted by zzlong on Wed, 05 Jan 2022 07:07:00 +0100

Article catalog

Research background (some nonsense)

Web crawler (also known as web spider, web robot, more often called web chaser in FOAF community) is a program or script that automatically grabs World Wide Web information according to certain rules.

Generally speaking, if you need information on the Internet, such as commodity prices, pictures and video resources, but you don't want or can't open the web page collection one by one, then you write a program to let the program collect information on the Internet according to the rules you specify. This is crawler, baidu we know, Google and other search engines are actually a huge crawler.

Are reptiles legal?

Generally speaking, as long as it does not affect the normal operation of other people's websites, and it is not for commercial purposes, people generally only seal their IP and account numbers, so as not to pose legal risks.

Background of the article: (this article is only for learning and communication)

Crawl an English article and analyze it. This paper selects Jane Eyre's novels, including word frequency analysis and word length statistics. With Chinese word cloud.

Tip: the following is the main content of this article. The following cases can be used for reference

1, Related principles

Web crawler is a program that automatically extracts web pages Search Engines Downloading web pages from the world wide web is an important part of search engines. The traditional crawler obtains the URL on the initial web page from the URL of one or several initial web pages. In the process of grabbing the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stop conditions of the system are met. The workflow of focus crawler is complex and needs to be based on certain requirements Web page analysis algorithm Filter the links irrelevant to the topic, keep the useful links and put them into the URL queue waiting to be fetched. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until a certain condition of the system is reached. In addition, all web pages captured by crawlers will be stored by the system, analyzed, filtered and indexed for future query and retrieval; For focused crawlers, the analysis results obtained in this process may also give feedback and guidance to the future grasping process.

Compared with the general web crawler, the focus crawler also needs to solve three main problems:

(1) Description or definition of capture target;

(2) Analysis and filtering of web pages or data;

(3) Right URL of search strategy .

The environment used in the article:

        python 3.9 + bs4 + jieba + matplotlib + wordcloud + PIL + numpy + urllib 

You also need some front-end knowledge to parse html text.

See below for the specific principle.

2, Design idea

Find target - crawl - save - Analyze

2, Implementation process

1. Analysis page

1. Reptile target

 2. Page analysis

 3. Chapter URL

Chapter I:

Chapter II:

Chapter III:


The URL of Chapter i is: re/ {{1975+i-1}} .html

Thus, the url of chapter 28 is:

1975+28-1 = 2002

https :// verification:

2. Introduction module

from bs4 import BeautifulSoup  # Web page parsing
import urllib.request, urllib.error  # Specify URL
import jieba    # participle
from matplotlib import pyplot as plt  # Visualization of drawing data
from wordcloud import WordCloud     # Word cloud
from PIL import Image       # Image processing
import numpy as np      # Matrix operation
import xlwt  # EXCEL processing

3. Obtain data

Use the library function, splice the url, visit the website circularly, parse the crawled html, and save it circularly to the txt document.

See the comments in the source code for details.

4. Analyze data

See the comments in the source code for details.              

5. CI Yun

See the comments in the source code for details.

3, Result display

 1. English text

2. Chinese text

3. Analysis of word frequency and letter number

4. Word cloud

IV. feelings and project structure

The functions in the project are fully implemented and encapsulated, such as the function of outputting the number of words with letters (you can count the length of words that are not a specific number):

#Get the number of words with length and more letters
def get_count_len(items, length: int):
    count_len_more = 0
    count_text = 0 # Number of words in this list
    for word, count in items:
        count_text += count
        if len(word) >= length:
            count_len_more += count
    return  count_len_more, count_text

Project structure:


The codes to note are:

 1.# lambda x:x[0] sorts tuples (key s) in the list by sort, x:x[1] represents dictionary values (values) to sort, and reverse=true indicates descending

def sort_dir(dictionary):
    items = list(dictionary.items())
    items.sort(key=lambda x: x[1],
    return items

  2.# counts.get (word,0)+ 1 refers to the value returned when there is word. The default value is 0 and + 1 can accumulate times; If there is no word, 0 is returned.

def make_dictionary(text_en):
    words = text_en.split()
    dictionary = {}
    for word in words:
        dictionary[word] = dictionary.get(word, 0) + 1 
    return dictionary

5, Source code

 1 # -*- coding = utf-8 -*-
 2 # @Time : 2022/1/3 11:35
 3 # @Author : butupi
 4 # @File :
 5 # @Software : PyCharm
 7 from bs4 import BeautifulSoup  # Web page parsing
 8 import urllib.request, urllib.error  # appoint URL
11 def main():
12     # Crawling url_base part
13     baseurl = "" #1975-2002
14     # Crawl save
15     getData(baseurl)
18 # Crawl web pages
19 def getData(baseurl):
20     #Jane Eyre has 28 chapters
21     for i in range(0, 28):
22         #complete URL,url The starting number is 1975, the beginning of each chapter url All add 1 on this basis
23         url = baseurl + str(1975+i)+".html"  # Page feed reading
24         #Visit web page
25         html = askURL(url)  # Save the obtained source code
26         # Parse save
27         soup = BeautifulSoup(html, "html.parser")
28         with open("wbfx.txt", "a+", encoding="utf-8") as f:
29             f.writelines("\n Chapter"+str(i+1)+" \n")
30             for item in soup.find_all("div", class_="chapter-lan-en toleft"):  # Find a string that meets the requirements
31                 for p in item.p.children:
32                     f.writelines(p.string.replace(u'\xa0', ''))
33         with open("Jane Eyre.txt", "a+", encoding="utf-8") as f:
34             f.writelines("\n Number" + str(i+1) + "chapter \n")
35             for item in soup.find_all("div", class_="chapter-lan-zh toright"):  # Find a string that meets the requirements
36                 for p in item.p.children:
37                     f.writelines(p.string.replace(u'\xa0', ''))
40 # Crawl assignment URL document info
41 def askURL(url):
42     #packing request Request header
43     head = {
44         "User-Agent": "Mozilla / 5.0 AppleWebKit " #...
45     }
46     request = urllib.request.Request(url, headers=head)
47     #Save crawled pages
48     html = ""
49     try:
50         response = urllib.request.urlopen(request)
51         html ="utf-8")
52         #print(html)
53     #exception handling
54     except urllib.error.URLError as e:
55         if hasattr(e, "code"):
56             print(e.code)
57         if hasattr(e, "reason"):
58             print(e.reason)
59     return html
62 if __name__ == "__main__":
63     main()
64     print("Crawling completed")



 1 # -*- coding = UTF-8 -*-
 2 # @Time : 2022/01/03 13:39
 3 # @Author : butupi
 4 # @File :
 5 # @Software : PyCharm
 7 import jieba    # participle
 8 from matplotlib import pyplot as plt  # Visualization of drawing data
 9 from wordcloud import WordCloud     # Word cloud
10 from PIL import Image       # Image processing
11 import numpy as np      # Matrix operation
14 #---------------------------------
15 #Get Chinese and English articles
16 def get_text(path_en, path_zh):
17     text_en = ""
18     text_zh = ""
19     with open(path_en, "r", encoding="utf-8") as f:
20         for line in f.readlines():
21             text_en += line.rstrip("\n")
22     # print(text_en)
23     with open(path_zh, "r", encoding="utf-8") as f:
24         for line in f.readlines():
25             text_zh += line.rstrip("\n")
26     return text_en, text_zh
28 #---------------------------------
30 #jieba participle
31 def cut_text(text):
32     cut = jieba.cut(text)
33     cut_words = ' '.join(cut)
34     # print(type(cut_words))
35     # print(len(cut_words))
36     return cut_words
38 #---------------------------------
39 #Get a list of stop words
40 def get_stop_words(filepath):
41     stop_words = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
42     return stop_words
44 #Remove stop words
45 def move_stop_words(text, path):
46     stop_words = get_stop_words(path)
47     out_str = ''
48     for word in text:
49         if word not in stop_words:
50             if word != '\t' and '\n':
51                 out_str += word
52     return out_str
53 #-----------------------------------
55 #Generate word cloud
56 def get_word_cloud(source_path, dest_path, out_str):
57     #Get background picture
58     img =
59     #Convert pictures to arrays
60     img_array = np.array(img)
61     #Word cloud attribute
62     wc = WordCloud(
63         background_color='white',
64         mask=img_array,
65         font_path="msyh.ttc"       # font
66     ).generate_from_text(out_str)
68     #Draw picture
69     fig = plt.figure(1)
70     plt.imshow(wc)      # Display as word cloud
71     plt.axis('off')     # Do not display coordinates
72  # Displays the generated word cloud picture
73     #Export to file
74     plt.savefig(dest_path, dpi=500)
76 #-------------------------------
77 def main():
78     #get_text
79     text_en, text_zh = get_text("wbfx.txt", "Jane Eyre.txt")
81     #jieba
82     cut_words = cut_text(text_zh)
84     #move_stop_words
85     out_str = move_stop_words(cut_words, 'stop_word.txt')
87     #wordcloud
88     get_word_cloud('zh-tree.jpg', 'zh-tree-out.jpg', out_str)
89     get_word_cloud('en-bj.png', 'en-bj-out.png', text_en)
92 #---------------------------------
93 if __name__ == '__main__':
94     main()


 1 # -*- coding = utf-8 -*-
 2 # @Time : 2022/01/03 20:45
 3 # @Author : butupi
 4 # @File :
 5 # @Software : PyCharm
 7 import xlwt  # EXCEL handle
 9 #Get English articles(Replace special characters)
10 def get_text(path_en):
11     text_en = ""
12     with open(path_en, "r", encoding="utf-8") as f:
13         for line in f.readlines():
14             text_en += line.rstrip("\n")
15     #a lowercase letter
16     text_en = text_en.lower()
17     #Replace special characters
18     for ch in '!"#$&()*+,-./:;<=>?@[\\]^_{|}·~\'''':
19         text_en = text_en.replace(ch, " ")
20     return text_en
23 #Process text into dictionary
24 def make_dictionary(text_en):
25     words = text_en.split()
26     dictionary = {}
27     for word in words:
28         dictionary[word] = dictionary.get(word, 0) + 1  # counts.get (word,0)+ 1 Means yes word The default value is 0,+1 Be able to accumulate times; No, word Returns 0 when.
29     return dictionary
31 #sort
32 def sort_dir(dictionary):
33     items = list(dictionary.items())
34     items.sort(key=lambda x: x[1],
35                reverse=True)  # lambda x:x[0]For tuples in the list( key)to sort Sort, x:x[1]Represents the value of the dictionary( values)to sort Sort, reverse=true Indicates descending order
36     return items
38 #obtain length Number of words with letters or more
39 def get_count_len(items, length: int):
40     count_len_more = 0
41     count_text = 0 # Number of words in this list
42     for word, count in items:
43         count_text += count
44         if len(word) >= length:
45             count_len_more += count
46     return  count_len_more, count_text
48 #Save to EXCEL In the table
49 def save_sheet(path, items):
50     workbook = xlwt.Workbook(encoding="utf-8")
51     worksheet = workbook.add_sheet("word_frequency")
52     worksheet.write(0, 0, "word")
53     worksheet.write(0, 1, "Number of occurrences")
54     for i in range(0, len(items)):
55         worksheet.write(i+1, 0, items[i][0])
56         worksheet.write(i+1, 1, items[i][1])
59 #Count word and word frequency
60 def main():
61     text_en = get_text("wbfx.txt")
63     #The word frequency of the processed text is counted and stored in the dictionary
64     dictionary = make_dictionary(text_en)
66     #Arrange the dictionaries in descending order to get a list with elements of( k-v)tuple(Words, frequency)
67     items = sort_dir(dictionary)
69     #The statistical results are saved as a list type, sorted by word frequency from high to low, and the first 15 bits are output
70     print("Jane Eyre's English novels Jane Eyre Top 15 word frequency rankings")
71     for i in range(15):
72         word, count = items[i]
73         print("{0}\t\t{1:<8}{2:>5}".format(i+1, word, count))
75     #Save to Excel In the table
76     save_sheet("word_frequency.xls", items)
78     #6 Number of words with letter length and above
79     length_word = 6
80     count_six_more, count_text = get_count_len(items, length_word)
81     #print(word, count)  # Words with 6 letters + and their frequency
82     print()
83     print("In the English text of Jane Eyre's Novels"+str(length_word)+"The number of words with letters or more is:\t"+str(count_six_more)+" individual")
84     print("Jane Eyre's Novels Jane-Eyre The total number of words in the text is:\t"+str(count_text)+" individual")
87 if __name__ == '__main__':
88     main()


Topics: Python crawler