1 Preface
The golden three silver four has just passed, and the autumn move is coming. In the busy and rolling season, the author once dreamed of grabbing all the positions in his favorite company with one click, and then breaking them one by one according to his own strengths and job hunting willingness to harvest a basket of offer s. In fact, you can easily complete the first step of this goal with the help of Python. This article will take the official website of PayPal, a famous financial technology company, as an example to show Python's tips for automatically capturing jobs in batches, so as to help you take a quick step on your job search!
Note: This article is only for learning and studying Python programming skills. In case of infringement, it will be deleted immediately.
2 preparation
PayPal recruitment official website
First, check the structure of PayPal job search official website. The published positions are displayed in the form of a list. Click the positions in the list to jump to the corresponding details page. At the same time, there are many positions in some countries and regions, which are displayed on multiple pages, and the URL will be distinguished by the corresponding page number, for example https://jobsearch.paypal-corp.com/en-US/search?facetcitystate=san%20jose ,ca&pagenumber=2. Therefore, the following steps are required for judgment to capture the details of each position:
-
Locate the position list and find the URL corresponding to each position
-
Traverse all pages, repeat the above operations, and store all position URL s
-
Through the URL of the position, access the position description, locate the location of the details, and save the position description
PayPal job description
3 build crawlers with code
For the configuration of Python environment, please refer to the previous article: Build Taobao order grabbing robot in two minutes
3.1 importing dependent packages
#Parsing web pages import requests from bs4 import BeautifulSoup #Table operation import numpy as np import pandas as pd #General import re import os import unicodedata
3.2 access position list
#Access the URL and retrieve the returned result def url_request(url): header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'} r = requests.get(url, headers=header) print("Connection status:", r.status_code, '\n') return r.text
3.3 position category analysis
#Find the required elements in the page and store the position list information def job_parser(html): header,desc,link = [[] for i in range(3)] soup = BeautifulSoup(html, "html.parser") #Right click to open the browser inspector and view the web page source code in the element tab. You can see that the category name of the position name is primary text color job result title, and it is an a tag job_header = soup.find_all('a', attrs={'class': 'primary-text-color job-result-title'}) #The element search method is the same as above job_link = soup.find_all('a', attrs={'class': 'primary-text-color job-result-title'}, href=True) header = [i.contents[0] for i in job_header] link = ["https://jobsearch.paypal-corp.com/"+i['href'] for i in job_link] #Save the results return pd.DataFrame({'Title':header, 'Link':link})
3.4 traverse all pages
#Create a dataframe that stores the results df = pd.DataFrame(columns=['Title','Link']) #Create a URL template, and add different page numbers to match different pages job_url_header = 'https://jobsearch.paypal-corp.com/en-US/search?facetcountry=cn&facetcity=shanghai&pagenumber=' #Traverse all pages and store the results for i in range(2): job_url = job_url_header + str(i+1) print('URL: {}'.format(job_url)) job_html = url_request(job_url) #Save the results of each page df = df.append(job_parser(job_html))
3.5 capture position details
def get_jd(url): jd_html = url_request(url) soup = BeautifulSoup(jd_html, "html.parser") jd_desc = soup.find('div', attrs={'class': 'jdp-job-description-card content-card'}) #JD formats are different. Only demonstration is given here if jd_desc: if jd_desc.findAll('ul')[:]: desc = [i.text + '\n{}'.format(j.text) for i,j in zip(jd_desc.findAll('p')[:], jd_desc.findAll('ul')[:])] else: desc = [i.text for i in jd_desc.findAll('p')[:]] return unicodedata.normalize("NFKD", '\n'.join(i for i in desc)) #Use the detail grab function for the previously stored content to save the details. df['JD'] = df['Link'].apply(get_jd) #Print results df.tail(2)
Title | Link | JD |
---|---|---|
Manager, APAC Portfolio Management | https://jobsearch.paypal-corp.com//en-US/job/manager-apac-portfolio-management/J3N1SM76FQPVMX4VFZG | As the Shanghai Team Manager of PayPal APAC Portfolio Management team in GSR Enterprise Seller Risk Ops, you will manage a team of underwriters, and drive a risk management strategy and framework leveraging your strong business and financial acumen, logical reasoning and communication skills. This role will be covering the markets such as Hong Kong, Taiwan, Korea and Japan, based out of Shanghai... |
FBO Accountant | https://jobsearch.paypal-corp.com//en-US/job/fbo-accoutant/J3W8C0677G8FLJQQZDL | Responsibilities Timely and effective reconciliation of all assigned General Ledger accounts, including timely and accurate clearing of reconciling items in accordance with Company Policy. Ensure accurate posting of general ledger... |
You can see that the position information has been successfully captured (the content has been intercepted). There are other information on the web page. You can also continue to add information according to your own needs. If you have any questions, please follow BulletTech and add wechat customer service for detailed discussion!