Python crawls Python's recruitment information on 51job.com

Posted by panic! on Mon, 25 Nov 2019 18:30:14 +0100

Preface

The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Author: my surname is Liu, but I can't keep your heart

PS: if you need Python learning materials, you can click the link below to get them by yourself

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

The fields obtained in this article are position name, company name, company location, salary, and release time

Create crawler project

scrapy startproject qianchengwuyou
​
cd qianchengwuyou
​
scrapy genspider -t crawl qcwy www.xxx.com

 

Fields to be crawled defined in items

1 import scrapy
2 3 4 class QianchengwuyouItem(scrapy.Item):
5     # define the fields for your item here like:
6     job_title = scrapy.Field()
7     company_name = scrapy.Field()
8     company_address = scrapy.Field()
9     salary = scrapy.Field()

 


   release_time = scrapy.Field()

Write main program in qcwy.py file

 1 import scrapy
 2 from scrapy.linkextractors import LinkExtractor
 3 from scrapy.spiders import CrawlSpider, Rule
 4 from qianchengwuyou.items import QianchengwuyouItem
 5  6 class QcwySpider(CrawlSpider):
 7     name = 'qcwy'
 8     # allowed_domains = ['www.xxx.com']
 9     start_urls = ['https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html?']
10     # https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,7.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
11     rules = (
12         Rule(LinkExtractor(allow=r'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,(\d+).html?'), callback='parse_item', follow=True),
13     )
14 15     def parse_item(self, response):
16 17         list_job = response.xpath('//div[@id="resultList"]/div[@class="el"][position()>1]')
18         for job in list_job:
19             item = QianchengwuyouItem()
20             item['job_title'] = job.xpath('./p/span/a/@title').extract_first()
21             item['company_name'] = job.xpath('./span[1]/a/@title').extract_first()
22             item['company_address'] = job.xpath('./span[2]/text()').extract_first()
23             item['salary'] = job.xpath('./span[3]/text()').extract_first()
24             item['release_time'] = job.xpath('./span[4]/text()').extract_first()
25             yield item

 

Write download rules in pipelines.py file

 1 import pymysql
 2  3 class QianchengwuyouPipeline(object):
 4     conn = None
 5     mycursor = None
 6  7     def open_spider(self, spider):
 8         print('Linked database...')
 9         self.conn = pymysql.connect(host='172.16.25.4', user='root', password='root', db='scrapy')
10         self.mycursor = self.conn.cursor()
11 12     def process_item(self, item, spider):
13         print('Writing database...')
14         job_title = item['job_title']
15         company_name = item['company_name']
16         company_address = item['company_address']
17         salary = item['salary']
18         release_time = item['release_time']
19         sql = 'insert into qcwy VALUES (null,"%s","%s","%s","%s","%s")' % (
20             job_title, company_name, company_address, salary, release_time)
21         bool = self.mycursor.execute(sql)
22         self.conn.commit()
23         return item
24 25     def close_spider(self, spider):
26         print('Write to database complete...')
27         self.mycursor.close()
28         self.conn.close()

 

Open download pipeline and request header in settings.py file
ITEM_PIPELINES = {
   'qianchengwuyou.pipelines.QianchengwuyouPipeline': 300,
}
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'

 

Run the crawler and write the. json file at the same time

scrapy crawl qcwy -o qcwy.json --nolog

 

Check whether the database is written successfully,

.

Topics: Python Database SQL JSON