Python crawler application - PayPal position capture

Posted by davard on Fri, 14 Jan 2022 14:49:28 +0100

1 Preface

The golden three silver four has just passed, and the autumn move is coming. In the busy and rolling season, the author once dreamed of grabbing all the positions in his favorite company with one click, and then breaking them one by one according to his own strengths and job hunting willingness to harvest a basket of offer s. In fact, you can easily complete the first step of this goal with the help of Python. This article will take the official website of PayPal, a famous financial technology company, as an example to show Python's tips for automatically capturing jobs in batches, so as to help you take a quick step on your job search!

Note: This article is only for learning and studying Python programming skills. In case of infringement, it will be deleted immediately.

2 preparation

PayPal recruitment official website

First, check the structure of PayPal job search official website. The published positions are displayed in the form of a list. Click the positions in the list to jump to the corresponding details page. At the same time, there are many positions in some countries and regions, which are displayed on multiple pages, and the URL will be distinguished by the corresponding page number, for example https://jobsearch.paypal-corp.com/en-US/search?facetcitystate=san%20jose ,ca&pagenumber=2. Therefore, the following steps are required for judgment to capture the details of each position:

Locate the position list and find the URL corresponding to each position
Traverse all pages, repeat the above operations, and store all position URL s
Through the URL of the position, access the position description, locate the location of the details, and save the position description

PayPal job description

3 build crawlers with code

For the configuration of Python environment, please refer to the previous article: Build Taobao order grabbing robot in two minutes

3.1 importing dependent packages

#Parsing web pages
import requests
from bs4 import BeautifulSoup

#Table operation
import numpy as np
import pandas as pd

#General
import re
import os
import unicodedata

3.2 access position list

#Access the URL and retrieve the returned result
def url_request(url):
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'}
    r = requests.get(url, headers=header)
    print("Connection status:", r.status_code, '\n')
    return r.text

3.3 position category analysis

#Find the required elements in the page and store the position list information
def job_parser(html):
    header,desc,link = [[] for i in range(3)]
    soup = BeautifulSoup(html, "html.parser")
    #Right click to open the browser inspector and view the web page source code in the element tab. You can see that the category name of the position name is primary text color job result title, and it is an a tag
    job_header = soup.find_all('a', attrs={'class': 'primary-text-color job-result-title'})
    #The element search method is the same as above
    job_link = soup.find_all('a', attrs={'class': 'primary-text-color job-result-title'}, href=True)

    header = [i.contents[0] for i in job_header]
    link = ["https://jobsearch.paypal-corp.com/"+i['href'] for i in job_link]

    #Save the results
    return pd.DataFrame({'Title':header, 'Link':link})

3.4 traverse all pages

#Create a dataframe that stores the results
df = pd.DataFrame(columns=['Title','Link'])
#Create a URL template, and add different page numbers to match different pages
job_url_header = 'https://jobsearch.paypal-corp.com/en-US/search?facetcountry=cn&facetcity=shanghai&pagenumber='

#Traverse all pages and store the results
for i in range(2):
  job_url = job_url_header + str(i+1)
  print('URL: {}'.format(job_url))
  job_html = url_request(job_url)
  #Save the results of each page
  df = df.append(job_parser(job_html))

3.5 capture position details

def get_jd(url):
  jd_html = url_request(url)
  soup = BeautifulSoup(jd_html, "html.parser")
  jd_desc = soup.find('div', attrs={'class': 'jdp-job-description-card content-card'})
  #JD formats are different. Only demonstration is given here
  if jd_desc:
    if jd_desc.findAll('ul')[:]:
      desc = [i.text + '\n{}'.format(j.text) for i,j in zip(jd_desc.findAll('p')[:], jd_desc.findAll('ul')[:])]
    else:
      desc = [i.text  for i in jd_desc.findAll('p')[:]]

    return unicodedata.normalize("NFKD", '\n'.join(i for i in desc))

#Use the detail grab function for the previously stored content to save the details.
df['JD'] = df['Link'].apply(get_jd)

#Print results
df.tail(2)

Title	Link	JD
Manager, APAC Portfolio Management	https://jobsearch.paypal-corp.com//en-US/job/manager-apac-portfolio-management/J3N1SM76FQPVMX4VFZG	As the Shanghai Team Manager of PayPal APAC Portfolio Management team in GSR Enterprise Seller Risk Ops, you will manage a team of underwriters, and drive a risk management strategy and framework leveraging your strong business and financial acumen, logical reasoning and communication skills. This role will be covering the markets such as Hong Kong, Taiwan, Korea and Japan, based out of Shanghai...
FBO Accountant	https://jobsearch.paypal-corp.com//en-US/job/fbo-accoutant/J3W8C0677G8FLJQQZDL	Responsibilities Timely and effective reconciliation of all assigned General Ledger accounts, including timely and accurate clearing of reconciling items in accordance with Company Policy. Ensure accurate posting of general ledger...

You can see that the position information has been successfully captured (the content has been intercepted). There are other information on the web page. You can also continue to add information according to your own needs. If you have any questions, please follow BulletTech and add wechat customer service for detailed discussion!

Topics: Python crawler Data Analysis

Programmer Think