Open source algorithm management and recommendation for specific problems| 2021SC@SDUSC

Posted by sfnitro230 on Thu, 14 Oct 2021 01:12:05 +0200

2021SC@SDUSC

 

Catalogue of series articles

(1) Division of labor within the group

(2) Task 1 code analysis of crawler part (Part I)

(3) Task 1: code analysis of crawler part (Part 2)

catalogue

Catalogue of series articles

preface

1, Core code analysis

2, Data set status

summary


preface

Following the above, continue to analyze the crawler code

1, Core code analysis

After entering the search interface, we need to grab the title of the article, turn pages and crawl multiple pages at the same time. We use the following function.

  (2) def page_url_list(soup, page=0):

The main function of this function is to get all the URLs that need to be crawled and save them to the list urls_list as the return result

The home page is as follows

fir_page = "http://xueshu.baidu.com" + soup.find_all("a", class_="n")[0]["href"]
urls_list = []

How to turn the page? Let's open multiple pages to observe the web page URL:
first page:
https://xueshu.baidu.com/s?wd=%E4%B8%93%E5%88%A9&pn=0&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8&f=3&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D&sc_hit=1
Page 2:
https://xueshu.baidu.com/s?wd=%E4%B8%93%E5%88%A9&pn=10&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8&f=3&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D&sc_hit=1
Page 3:
https://xueshu.baidu.com/s?wd=%E4%B8%93%E5%88%A9&pn=20&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8&f=3&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D&sc_hit=1
It can be found that only one of the three page URL s has changed, that is, the value of "pn", starting from 0 and then increasing by 10 each time. Therefore, we can turn the page well through this.

next_page = fir_page.replace("pn=10", "pn={:d}".format(i * 10))

Then use find_all method to get the title

 

 

c_fonts = soup_new.find_all("h3", class_="t c_font")

Then cycle at C_ Use C in fonts_ Font. Find ("a"). Attrs ["href"] can obtain all required URLs

So far, the analysis of the second function is completed
 

The next step is to grab the places of interest. When we enter the details page, we need to capture the following things: title, summary and keywords. Or according to the old method, check the source code of these things that need to be crawled one by one, and handle them with CSS select method.

(3) def get_item_info(url):

First, requests is used to convert links into html language, while beautiful soup is used to find the required content.

content_details = requests.get(url)
soup = BeautifulSoup(content_details.text, "lxml")

Then, you can get the required content by using the soup.select() method.

The prototype of soup.select() in the source code is: select(self, selector, namespaces=None, limit=None, **kwargs)
Function: find what we need in html
The main parameter we use is selector, which is defined as "string containing CSS selector". For CCS, we also need to understand some concepts. Refer to CCS syntax, CSS Id and Class.

When we write CSS, we do not add any modification to the tag name, add dots before the class name and add #, here we can also use similar methods to filter elements

 

 

   
    # Extract article title
    try:
        title = ''.join(list(soup.select('#dtl_l > div > h3 > a')[0].stripped_strings))
    except(IndexError):
        title = ''


    # Extract summary
    try:
        abstract = list(soup.select('div.abstract_wr p.abstract')[0].stripped_strings)[0].replace("\u3000", ' ')
    except(IndexError):
        abstract = ''


    # Extract keywords
    try:
        key_words = ';'.join(key_word for key_word in list(soup.select('div.dtl_search_word > div')[0].stripped_strings)[1:-1:2])
    except(IndexError):
        key_words = ''

#     data = {
#         "title":title,
#         "abstract":abstract,
#         "key_words":key_words
#     }
    return title,  abstract,  key_words

So far, the third function analysis is completed

The following three functions are relatively simple. Let's briefly talk about them here

(4) def get_all_data(urls_list):

This function is mainly used to filter the crawled data and store the data whose three items are not empty in the dictionary

    dit = defaultdict(list)
    for url in urls_list:
        title,abstract, key_words = get_item_info(url)
        if (len(str(title)) > 0 and len(str(abstract)) > 0 and len(str(key_words)) > 0):
            dit["title"].append(title)
            dit["abstract"].append(abstract)
            dit["key_words"].append(key_words)
    return dit

So far, the fourth function analysis is completed

(5) def save_csv(dit,num):

This function is used to save the previously filtered data stored in the dictionary to the local csv file. The if statement is used to simply process the header

    data = pd.DataFrame(dit)
    columns = ["title",  "abstract", "key_words"]
    if num == 1:
        data.to_csv("data.csv", mode='a',index=False, columns=columns)
    else:
        data.to_csv("data.csv", mode='a', index=False , header=False)
    print("That's OK!")

So far, the fifth function analysis is completed

(6) main function

The main function is responsible for looping each keyword in the given keyword list and calling the above functions. Each keyword is stored in about 300 pieces of data. Here are two keywords as examples

    key_words=["patent","Beef quality"]
    num=1
    for key_word in key_words:
        soup = driver_open(key_word)
        urls_list = page_url_list(soup, page=100)
        dit = get_all_data(urls_list)
        save_csv(dit,num)
        num+=1

So far, the analysis of the crawler code is completed

2, Data set status

There are more than 5000 in total, meeting the requirements


summary

The above is the analysis of the crawler code in task 1. The next blog will conduct statistical analysis on the data set

Topics: Python crawler