Which weapon is the hottest in the world of monster hunters? Data Mining Exercises - Second, python crawls MHW Post Bar topics and statistics

Posted by retoto on Wed, 31 Jul 2019 20:24:25 +0200

brief introduction

As we all know, Monster Hunter World is a simulated massage game.

In a new world with powerful creatures, due to the shortage of natural resources and frequent contradictions between monsters, players play masseurs, who need to use foot repair, back lifting, Gatling special massage and other techniques with the help of their own excellent massage skills and equipment, and powerful aibo. Make the monster to a state of refreshing, after success, the monster will rest, and from the heart thank the players and give away items such as dragon and jade.

Ultimately to achieve the harmonious development of people, monsters and animals, this is a very positive energy game.

In the process of massage, it is also important to choose the right aibo.

Some of the monsters will be happy because they are too comfortable with the performance of tail flick, "dragon car", "Tathagata God's Palm" and other stunts to express their gratitude to the masseur.

imperial chariot
The Palm of the Tathagata
Forehead... It should be this one.
True Tathagata Palm

So, the question is, how do the technicians who are targeting the Magic Monster Master choose their weapons? In this article, I will show you which weapon is the most popular among technicians.

Prepare for 0x00 crawl

Consistent with my previous blog
The crawling method is relatively simple. It does not need to install seletium, but only needs Beautiful Soup and requests package.

BeautifulSoup installation method refers to blog:

BeautifulSoup installation method

0x01 crawling idea

First of all, we need to make sure that the object we crawl is the famous water sticker gathering place - sticker bar.

Our goal is to find the theme of the post, and to count the frequency of weapons, as a reference for our statistical heat.

In this article, the author Echo's article is known about the crawling reference of data such as posting topics, links, posters, posting time, number of Posts and so on.
https://zhuanlan.zhihu.com/p/26722495
Detailed crawling ideas can enter the original author's article to view, this article will briefly introduce.

Implementation of Climbing Replacement Code 0x02

First, 1 crawl the url of the required number of Posts

base_url = 'https://tieba.baidu.com/f?kw=%E6%80%AA%E7%89%A9%E7%8C%8E%E4%BA%BA%E4%B8%96%E7%95%8C&ie=utf-8'
    deep = 3
url_list = []
    for i in range(0,deep):
        url_list.append(base_url + '&pn=' + str(i*50))

2 Then use this function to get html in url for BS4 parsing

def get_html(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = "utf-8"
        return r.text
    except:
        return "error"

3 Use BS4 to parse url, find title, link, sender, time and number of replies

def get_content(url):
    comments = []
    html = get_html(url)
    soup = BeautifulSoup(html, 'lxml')
    # Find every theme post
    liTags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'})
    # print (liTags)
    for li in liTags:
        # Initialize a dictionary store
        comment = {}
        try:
            comment['title'] = li.find(
            'a', attrs={'class':'j_th_tit'}).text.strip()
            comment['link'] = "http://tie.baidu.com/" + \
                li.find('a', attrs = {'class':'j_th_tit'})['href']
            comment['name'] = li.find(
                'span', attrs = {'class': 'tb_icon_author'}).text.strip()
            comment['time'] = li.find(
                'span', attrs = {'class':'pull-right is_show_create_time'}).text.strip()
            comment['replyNum'] = li.find(
            'span', attrs={'class': 'threadlist_rep_num center_text'}).text.strip()
            comments.append(comment)
        except:
            print ('Some problem happened!')
    return comments

4 Store crawl results in files

def out2file(dict):
    with open('TB.txt','a+',encoding='utf-8') as f:
        for comment in dict:
            f.write('Title: {} \t Links:{} \t Poster:{} \t Post time:{} \t Number of replies: {} \n'.format(
                comment['title'], comment['link'], comment['name'], comment['time'], comment['replyNum']))

So far, we have crawled the "title, link, poster, time and number of replies" information of the specified number of pages, and the next goal is to parse the keywords in the theme posting topic, and then make statistics.

5 Global variables to be used for initialization, where weapons records the official and colloquial names of all weapons (possibly incomplete)

def data_init():
    global weapons
    global weapon_count
    # all weapon
    weapons = ['Single-handed sword|A Sword', 'Two-handed sword|Double knife|Double Swords', 'Big sword|Epee', \
               'Tai Dao', 'hammer|Hammer', 'hunting horn|Flute', 'Long gun', 'Gunlance', \
               'Chopper axe|Sword axe', 'Fueled axe|Shield axe', 'Wormstick|stick|stick', \
               'arch', 'Light crossbow', 'Heavy crossbow']
    # init weapon count
    for weapon in weapons:
        weapon_count.update({weapon: 0})
    print(weapon_count)

6 Use re to parse the topic of the post, count the keywords and print them

def keyword_count(dict,page):
    global weapons
    global weapon_count
    for weapon in weapons:
        count = 0;
        for comment in dict:
            count = count + len(re.findall(weapon, comment['title']))
        weapon_count.update({weapon : (weapon_count[weapon]+count)})
    print("The first%d Page results summary:" % (page+1))
    print(weapon_count)

So far, our goal has been achieved, the author tested it, the results are as follows

It can be found that Taidao and bow weapons are the most favorite tools of Hunter technicians, while hammer, heavy crossbow, shotgun, single-handed sword mention rate is very low, even chopper axe is 0. This may be because the keyword set is not enough, or play this weapon is hard core player bar.

PS: Are there so few hammers? Has slope theology gone down yet?

This is the end of the small training project. Thank the author of echo for his inspiration. Welcome to test and point out the mistakes in this article.

Finally, I wish all the technicians a pleasant journey.

The next training project should not be a crawler, but move on to the next stage (data analysis, language processing or feature target detection).

Source github link: https://github.com/gangyu0716/spider_project

Author blog address: https://blog.csdn.net/nurke

Topics: encoding github REST IE