Take you to realize a wave of paper freedom with python~

Posted by Archbob on Fri, 25 Feb 2022 04:50:41 +0100

Relevant documents

Little buddy who wants to learn Python can pay attention to the official account of Xiaobian [Python journal].
There are many resources for whoring for nothing, ha. I will update the little knowledge of Python from time to time!!
Python learning exchange group: 773162165

development environment

Python version: 3.7.8
Related modules:
requests module;
Beatifulsoup4 module;
click module;
prettytable module;
paperdl module;
And some python built-in modules.

Environment construction

Install Python and add it to the environment variable. pip can install the relevant modules required.

Principle introduction

First of all, make sure that our goal is to search for papers on Google academic, and then download the papers we searched through SCI hub. Therefore, the first step we need to achieve is to search for papers on Google academic, and simply grab the package to find:
You can get the search results by requesting the following link:

https://scholar.google.com/scholar?

The parameters to be carried are also very simple:
q is the keyword of the paper we searched, and the values of other parameters can be regarded as fixed. Finally, we can extract the paper information we need with the help of the beatifulsoup4 package. Specifically, the core code is implemented as follows:

'''search paper'''
def search(self, keyword):
    # search
    if self.config.get('area') == 'CN':
        search_url = random.choice([
            'https://xs2.dailyheadlines.cc/scholar',
            'https://scholar.lanfanshu.cn/scholar',
        ])
        params = {
            'hl': 'zh-CN',
            'as_sdt': '0,33',
            'q': keyword,
            'btnG': ''
        }
    else:
        search_url = 'https://scholar.google.com/scholar'
        params = {
            'hl': 'en',
            'as_sdt': '0,5',
            'q': keyword,
            'btnG': ''
        }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
    }
    response = self.session.get(search_url, params=params, headers=headers)
    # parse
    soup = BeautifulSoup(response.text, features='lxml')
    papers = soup.find_all('div', class_='gs_r')
    paperinfos = []
    for paper in papers:
        try:
            pdf = paper.find('div', class_='gs_ggs gs_fl')
            link = paper.find('h3', class_='gs_rt')
            if pdf: input_content = pdf.find('a')['href']
            elif link.find('a'): input_content = link.find('a')['href']
            else: continue
            title = link.text
            authors = paper.find('div', class_='gs_a').text.split('\xa0')[0]
            paperinfo = {
                'input': input_content,
                'source': self.source,
                'savedir': self.config['savedir'],
                'ext': 'pdf',
                'savename': title,
                'title': title,
                'authors': authors,
                'download_url': None,
            }
        except:
            continue
        paperinfos.append(paperinfo)
        if len(paperinfos) == self.config['search_size_per_source']: break
    # return
    return paperinfos

The final effect is as follows:
Perhaps a small partner will ask, how to use Google academic in China? In fact, everyone knows the method, but I still want to share a method that can be said on the surface, that is, using the domestic Google academic image:
If you look carefully at the code I posted, you should have found that you only need to set the area in the config file to China:
It will use the domestic Google school image by default, so my code will not be unable to use because of the wall. Moreover, I have added the function of setting proxies in my code. picture

Next, we need to use SCI hub to download the paper. The process of capturing the package is also very simple, so I won't repeat it. The final code is as follows:

'''parse paper infos before dowload paper'''
def parseinfosbeforedownload(self, paperinfos):
    sci_sources = [
        'https://sci-hub.st/', 
        'https://sci-hub.ru/',
        'https://sci-hub.se/',
    ]
    # fetch pdf url
    for paperinfo in paperinfos:
        input_content = paperinfo['input']
        input_type = self.guessinputtype(input_content)
        if input_type == 'pdf': 
            paperinfo['download_url'] = input_content
        else:
            data = {'request': input_content}
            for sci_source in sci_sources:
                try:
                    response = self.session.post(sci_source, data=data, verify=False)
                    html = etree.HTML(response.content)
                    article = html.xpath('//div[@id="article"]/embed[1]') or html.xpath('//div[@id="article"]/iframe[1]') if html is not None else None
                    pdf_url = urlparse(article[0].attrib['src'], scheme='http').geturl()
                    paperinfo['download_url'] = pdf_url
                    break
                except:
                    continue
        if 'download_url' not in paperinfo: paperinfo['download_url'] = None
        paperinfo['source'] = self.source
    # return
    return paperinfos

In the code, I set up three links that are still available in SCI hub to prevent only one link from suddenly invalidating the whole code.

ok, it's done. See the relevant documents for the complete source code~

Effect display

All the code involved in this article is already open source here:

https://github.com/CharlesPikachu/paperdl

And for your convenience, it has been packaged and uploaded to pypi. You only need to install pip:

pip install paperdl

Then simply write a few lines of code:

from paperdl import paperdl

config = {'logfilepath': 'paperdl.log', 'savedir': 'papers', 'search_size_per_source': 5, 'proxies': {}}
target_srcs = ['arxiv', 'googlescholar']
client = paperdl.Paperdl(config=config)
client.run(target_srcs)

For domestic users who cannot fq, you can set config to:

config = {'area': 'CN', 'logfilepath': 'paperdl.log', 'savedir': 'papers', 'search_size_per_source': 5, 'proxies': {}}

In addition, you can only use the paper download function of scihub, for example:

from paperdl import paperdl

config = {'logfilepath': 'paperdl.log', 'savedir': 'papers', 'search_size_per_source': 5, 'proxies': {}}
client = paperdl.SciHub(config=config, logger_handle=paperdl.Logger('paper.log'))![Insert picture description here](https://img-blog.csdnimg.cn/cc534ced240040e5b0c50a4f8b480ab8.png#pic_center)

paperinfo = {
    'savename': 'mypaper',
    'ext': 'pdf',
    'savedir': 'outputs',
    'input': 'https://ieeexplore.ieee.org/document/9193963/',
    'source': 'scihub',
}
client.download([paperinfo])

The effect is as follows:
Little buddy who wants to learn Python can pay attention to the official account of Xiaobian [Python journal].
There are many resources for whoring for nothing, ha. I will update the little knowledge of Python from time to time!!
Python learning exchange group: 773162165

Topics: Python Back-end