Tool | use arXiv API + Github Actions to automatically obtain arXiv paper abstracts every day

Posted by neclord81 on Tue, 26 Oct 2021 01:29:36 +0200

Students who often pay attention to the latest developments in the academic community may be very familiar with arXiv. It is the largest academic open sharing platform in the world. At present, nearly 2 million academic articles in 8 disciplines are stored [1]. Scholars often hang their upcoming articles on arXiv for peer review, which greatly promotes the openness and collaboration of the academic community.

Many articles dazzle people and make people unable to get articles in their areas of concern immediately. The author recently used arXiv API[2] + Github Actions[3] to automatically obtain relevant topic articles from arXiv and publish them on Github and Github Page every day. Click preview here.

First, a preview is given. Github's README.md lists the latest articles on SLAM in the form of a table, which looks clear at a glance.

Introduction to arXiv API

Basic grammar

The arXiv API[2] allows users to programmatically access millions of electronic papers hosted on arXiv.org. The user manual of arXiv API[2] provides the basic syntax of paper retrieval. According to the syntax provided, the corresponding paper can be retrieved metadata , that is, metadata, including thesis title, author, abstract, comments and other information. The format of API call is as follows:

http://export.arxiv.org/api/{method_name}?{parameters}

In method_name=query as an example, we want to retrieve the articles by Adrian DelMaestro, the author of the paper, and the title of the paper contains checkerboard, which can be written as follows:

http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard

Where the prefix au represents author, ti represents Title, and + is the encoding of spaces (because spaces cannot appear in the url).

prefixexplanation
tiTitle
auAuthor
absAbstract
coComment
jrJournal Reference
catSubject Category
rnReport Number
idId (use id_list instead)
allAll of the above

In addition, AND represents AND operation. The query method of API supports Boolean operations: AND, OR AND ANDNOT.

The above search results are returned in the form of Atom feeds. Any language that can make HTTP requests and parse Atom feeds can call this API. Take Python as an example:

import urllib.request as libreq
with libreq.urlopen('http://export.arxiv.org/api/query?search_query=au:del_maestro+AND+ti:checkerboard') as url:
    r = url.read()
print(r)

The printed result contains the metadata of the paper, so the next task is to parse the data and write down the information we pay attention to in a certain format.

arxiv.py ox knife

Someone has helped us to analyze the above results. We don't have to build wheels again. Of course, the way of paper query is also more elegant. Here we recommend arxiv.py[5].

Install arxiv.py first:

pip install arxiv

Then import arxiv in the Python script.

Taking the search SLAM as the keyword, it is required to return 10 results and sort them according to the release date. The script is as follows:

import arxiv

search = arxiv.Search(
  query = "SLAM",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)
for result in search.results():
  print(result.entry_id, '->', result.title)

In the above script, the (Search).results() function returns the metadata of the paper. arxiv.py has been parsed for us. We can directly call elements such as result.title, similar to:

elementexplanation
entry_idA url http://arxiv.org/abs/{id}.
updatedWhen the result was last updated.
publishedWhen the result was originally published.
titleThe title of the result.
authorsThe result's authors, as arxiv.Authors.
summaryThe result abstract.
commentThe authors' comment if present.
journal_refA journal reference if present.
doiA URL for the resolved DOI to an external resource if present.
primary_categoryThe result's primary arXiv category. See arXiv: Category Taxonomy[4].
categoriesAll of the result's categories. See arXiv: Category Taxonomy.
linksUp to three URLs associated with this result, as arxiv.Links.
pdf_urlA URL for the result's PDF if present. Note: this URL also appears among result.links.

The above search script prints the following results on the terminal:

http://arxiv.org/abs/2110.11040v1 -> InterpolationSLAM: A Novel Robust Visual SLAM System in Rotational Motion
http://arxiv.org/abs/2110.10329v1 -> SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training
http://arxiv.org/abs/2110.09156v1 -> Enhancing exploration algorithms for navigation with visual SLAM
http://arxiv.org/abs/2110.08977v1 -> Accurate and Robust Object-oriented SLAM with 3D Quadric Landmark Construction in Outdoor Environment
http://arxiv.org/abs/2110.08639v1 -> Partial Hierarchical Pose Graph Optimization for SLAM
http://arxiv.org/abs/2110.07546v1 -> Active SLAM over Continuous Trajectory and Control: A Covariance-Feedback Approach
http://arxiv.org/abs/2110.06541v2 -> Collaborative Radio SLAM for Multiple Robots based on WiFi Fingerprint Similarity
http://arxiv.org/abs/2110.05734v1 -> Learning Efficient Multi-Agent Cooperative Visual Exploration
http://arxiv.org/abs/2110.03234v1 -> Self-Supervised Depth Completion for Active Stereo
http://arxiv.org/abs/2110.02593v1 -> InterpolationSLAM: A Novel Robust Visual SLAM System in Rotating Scenes

Next script daily_arxiv.py The paper on SLAM will be obtained from arXiv, and the release time, paper name, author, code and other information of the paper will be made into a Markdown table and written as README.md file.

import datetime
import requests
import json
import arxiv
import os
def get_authors(authors, first_author = False):
    output = str()
    if first_author == False:
        output = ", ".join(str(author) for author in authors)
    else:
        output = authors[0]
    return output
def sort_papers(papers):
    output = dict()
    keys = list(papers.keys())
    keys.sort(reverse=True)
    for key in keys:
        output[key] = papers[key]
    return output    

def get_daily_papers(topic,query="slam", max_results=2):
    """
    @param topic: str
    @param query: str
    @return paper_with_code: dict
    """

    # output 
    content = dict() 
    
    search_engine = arxiv.Search(
        query = query,
        max_results = max_results,
        sort_by = arxiv.SortCriterion.SubmittedDate
    )

    for result in search_engine.results():

        paper_id       = result.get_short_id()
        paper_title    = result.title
        paper_url      = result.entry_id

        paper_abstract = result.summary.replace("\n"," ")
        paper_authors  = get_authors(result.authors)
        paper_first_author = get_authors(result.authors,first_author = True)
        primary_category = result.primary_category

        publish_time = result.published.date()

        print("Time = ", publish_time ,
              " title = ", paper_title,
              " author = ", paper_first_author)

        # eg: 2108.09112v1 -> 2108.09112
        ver_pos = paper_id.find('v')
        if ver_pos == -1:
            paper_key = paper_id
        else:
            paper_key = paper_id[0:ver_pos] 

        content[paper_key] = f"|**{publish_time}**|**{paper_title}**|{paper_first_author} et.al.|[{paper_id}]({paper_url})|\n"
    data = {topic:content}
    
    return data 

def update_json_file(filename,data_all):
    with open(filename,"r") as f:
        content = f.read()
        if not content:
            m = {}
        else:
            m = json.loads(content)
            
    json_data = m.copy() 
    
    # update papers in each keywords         
    for data in data_all:
        for keyword in data.keys():
            papers = data[keyword]

            if keyword in json_data.keys():
                json_data[keyword].update(papers)
            else:
                json_data[keyword] = papers

    with open(filename,"w") as f:
        json.dump(json_data,f)
    
def json_to_md(filename):
    """
    @param filename: str
    @return None
    """
    
    DateNow = datetime.date.today()
    DateNow = str(DateNow)
    DateNow = DateNow.replace('-','.')
    
    with open(filename,"r") as f:
        content = f.read()
        if not content:
            data = {}
        else:
            data = json.loads(content)

    md_filename = "README.md"  
      
    # clean README.md if daily already exist else create it
    with open(md_filename,"w+") as f:
        pass

    # write data into README.md
    with open(md_filename,"a+") as f:
  
        f.write("## Updated on " + DateNow + "\n\n")
        
        for keyword in data.keys():
            day_content = data[keyword]
            if not day_content:
                continue
            # the head of each part
            f.write(f"## {keyword}\n\n")
            f.write("|Publish Date|Title|Authors|PDF|\n" + "|---|---|---|---|\n")
            # sort papers by date
            day_content = sort_papers(day_content)
        
            for _,v in day_content.items():
                if v is not None:
                    f.write(v)

            f.write(f"\n")
    print("finished")     

if __name__ == "__main__":

    data_collector = []
    keywords = dict()
    keywords["SLAM"] = "SLAM"
 
    for topic,keyword in keywords.items():
 
        print("Keyword: " + topic)
        data = get_daily_papers(topic, query = keyword, max_results = 10)
        data_collector.append(data)
        print("\n")

    # update README.md file
    json_file = "cv-arxiv-daily.json"
    if ~os.path.exists(json_file):
        with open(json_file,'w')as a:
            print("create " + json_file)
    # update json data
    update_json_file(json_file,data_collector)
    # json data to markdown
    json_to_md(json_file)

The key points of the above script are:

  1. The retrieved topics and keywords are SLAM, and the latest 10 articles are returned;
  2. Note that the above topic is used as the name of the first two-level title of the table, and the keyword is the real content to be retrieved. Pay special attention to the multi search format for keywords with spaces, such as "camera Localization", in which the "table" is escaped. You can add the keywords you are interested in according to the rules;
  3. The list of papers is sorted according to the time published on arXiv, and the latest ones are at the top;

This seems to have been completed, but there are two problems: 1. It must be run manually every time it is used; 2. It can only be viewed locally. Github Actions comes in handy in order to automatically run the above scripts every day and synchronize them in Github warehouse.

Introduction to Github Actions

Once again, our goal is to use GitHub Actions to automatically obtain the papers on SLAM from arXiv every day, and make the release time, title, author, code and other information of the papers into a Markdown table and publish them on Github.

What is Github Actions?

Github Actions is a continuous integration service of GitHub, which was launched in October 2018.

The following is the official interpretation [3]:

GitHub Actions help you automate tasks within your software development life cycle. GitHub Actions are event-driven, meaning that you can run a series of commands after a specified event has occurred. For example, every time someone creates a pull request for a repository, you can automatically run a command that executes a software testing script.

In short, GitHub Actions is driven by Events to automate tasks.

Basic concepts

GitHub Actions has its own terms [10], [9].

  1. Workflow: the process of continuously integrating one operation is a workflow;
  2. job: a workflow is composed of one or more jobs, which means that a continuous integration run can complete multiple tasks;
  3. Step: each job consists of multiple steps, which are completed step by step;
  4. action: each step can execute one or more commands (actions) in turn;

deploy

Log in to your Github account, create a new warehouse, such as CV arXiv daily, click Actions, and then click Set up this workflow, as shown in the following figure:

After the above steps, a new file named black.yml will be created (as shown in the figure below). Its directory is. github/workflows /. Note that this directory cannot be changed. The workflow to be executed, that is, workflow, is stored in this folder. GitHub Actions will automatically identify the YML workflow file in this folder and execute it according to the rules.

This black.yml implements the simplest workflow: print Hello, world!.

It should be noted that GitHub Actions workflow has its own set of syntax. Due to space constraints, it will not be described in detail here. For details, please refer to [9].

In order to implement the python script in the previous section daily_arxiv.py Automatically run, it is not difficult to get the following workflow configuration cv-arxiv-daily.yml Note that the two environment variables GITHUB_USER_NAME and GITHUB_USER_EMAIL are replaced with their own ID and mailbox respectively.

# name of workflow
name: Run Arxiv Papers Daily

# Controls when the workflow will run
on:
  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:
  schedule:
    - cron:  "* 12 * * *"  # Runs every minute of 12th hour
env:

  GITHUB_USER_NAME: xxx # your github id
  GITHUB_USER_EMAIL: xxx # your email address
  
  
# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  build:
    name: update
    # The type of runner that the job will run on
    runs-on: ubuntu-latest
    
    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      - name: Checkout
        uses: actions/checkout@v2
        
      - name: Set up Python Env
        uses: actions/setup-python@v1
        with:
          python-version: 3.6        

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install arxiv
          pip install requests
          
      - name: Run daily arxiv 
        run: |
          python daily_arxiv.py
          
      - name: Push new cv-arxiv-daily.md
        uses: github-actions-x/commit@v2.8
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          commit-message: "Github Action Automatic Update CV Arxiv Papers"
          files: README.md cv-arxiv-daily.json
          rebase: 'true'
          name: ${{ env.GITHUB_USER_NAME }}
          email: ${{ env.GITHUB_USER_EMAIL }}

Among them, workflow_dispatch means that users can run by clicking manually, and schedule[7] means regular execution. Please check the specific rules Events that trigger workflows.

Used here cron It has five fields separated by spaces, as follows:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of the month (1 - 31)
│ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
│ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * *

Supplementary syntax:

OperatorDescriptionExample
*Any value* * * * * runs every minute of every day.
,Value list separator2,10 4,5 * * * runs at minute 2 and 10 of the 4th and 5th hour of every day.
-Range of values0 4-6 * * * runs at minute 0 of the 4th, 5th, and 6th hour.
/Step values20/15 * * * * runs every 15 minutes starting from minute 20 through 59 (minutes 20, 35, and 50).

The key points of the above workflow are summarized as follows:

  1. The event is triggered at 12:00 UTC time every day. The workflow is run and the push is not triggered;
  2. There is only one job named build, which runs in the virtual machine environment Ubuntu latest;
  3. The first step is to obtain the source code, and the actions used are actions/checkout@v2 ;
  4. The second step is to configure the python environment. The actions used are actions/setup-python@v1 , python version is 3.6;
  5. The third step is to install the dependent library, upgrade pip, install arxiv.py library and requests library respectively;
  6. The fourth step is to run the daily_arxiv.py script, which generates the json temporary file and the corresponding README.md;
  7. The fifth step is to push the code to the warehouse. The action used is github-actions-x/commit@v2.8[11] , the parameters to be configured include commit message to be submitted, files to be submitted, Github user name and email;

After the workflow is successfully deployed, a json file and README.md file will be generated under Github repo. At the same time, you will see the article list at the beginning of this article. The log in the background of Github Action is as follows:

summary

This paper introduces a method of automatically obtaining arXiv papers every day using Github Actions, which can easily obtain and preview the latest articles of interest. The examples listed in this article are more convenient to modify. Readers can select topics of interest by adding keywords. All the code in the article has been open source. See the end of the article for the address. The latest code adds the function of obtaining arXiv paper source code, adding several keywords and the function of automatically deploying to a Github Page.

In addition, the methods listed in this paper have several problems: 1. The generated json file is a temporary file, which can be deleted optimally; 2. The size of readme.md file will gradually increase over time, and the archiving function can be added later; 3. Not everyone will browse Github every day. In the future, the function of sending articles to the mailbox will be added.

Welcome to fork & star to create your own paper weapon:)

code: github.com/Vincentqyw/cv-arxiv-daily

Welcome to fork & star to create your own paper weapon:)

reference resources

[1]: About arXiv, https://arxiv.org/about

[2]: arXiv API User's Manual, https://arxiv.org/help/api/user-manual

[3]: Github Actions: https://docs.github.com/en/actions/learn-github-actions

[4]: arXiv Category Taxonomy: https://arxiv.org/category_taxonomy

[5]: Python wrapper for the arXiv API, https://github.com/lukasschwab/arxiv.py

[6]: Full package documentation: arxiv.arxiv, http://lukasschwab.me/arxiv.py/index.html

[7]: Github Actions on.schedule: https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#onschedule

[8]: Github Actions Events that trigger workflows: https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#scheduled-events

[9]: Workflow syntax for GitHub Actions, https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions

[10]: GitHub Actions Getting Started tutorial, http://www.ruanyifeng.com/blog/2019/09/getting-started-with-github-actions.html

[11]: Git commit and push, https://github.com/github-actions-x/commit

[12]: Generate a list of papers daily arxiv, https://github.com/zhuwenxing/daily_arxiv

Topics: Python github AR