Details of python crawler scratch project (attention, continuous update)

Posted by whatever on Mon, 02 Dec 2019 00:48:51 +0100

python crawler scratch project (1)

Crawling target: Tencent recruitment website (starting url: https://hr.tencent.com/position.php? Keywords = & TID = 0 & start)

Crawling content: position; position type; number of recruiters; work location; release time; recruitment detail link; job responsibilities; job requirements

Anti creep measures: set random user agent, set request delay operation

1. Start to create project

1 scrapy startproject tencent

2. Enter tencent folder, execute code to start spider crawler file, and write crawler file.

1 scrapy genspider hr "tencent.com"

After the command is executed, open the file directory with Python's best ide - pycharm, and the following file directory will be created in your current directory.

3. Write the items.py file in the directory, and set the fields you need to crawl.

 1 class TencentItem(scrapy.Item):
 2     # define the fields for your item here like:
 3         # position
 4         position = scrapy.Field()
 5         # Job type
 6         position_type = scrapy.Field()
 7         # Number of recruits
 8         persons = scrapy.Field()
 9         # Working place
10         place = scrapy.Field()
11         # Recruitment release time
12         time = scrapy.Field()
13         # Position detail link
14         detail_link = scrapy.Field()
15         # Operating duty
16         work_duty = scrapy.Field()
17         # Job requirements
18         work_request = scrapy.Field()

4. Enter spiders folder, open hr.py file, and start to write crawler file

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4 from items import TencentItem
 5 
 6 class HrSpider(scrapy.Spider):
 7     name = 'hr'
 8     allowed_domains = ['tencent.com']
 9     offset = 0
10     original_url = 'https://hr.tencent.com/position.php?keywords=&tid=0&start='
11     # Set dynamic start url
12     start_urls = ['https://hr.tencent.com/position.php?keywords=&tid=0&start=' + str(offset)]
13     
14     def parse(self, response):
15         # To write xpath The rules extract the data needed for data cleaning.
16         trs = response.xpath("//table[@class='tablelist']//tr")[1:-1]
17         for tr in trs:
18             item = TencentItem()
19             item["position"] = tr.xpath("./td[1]/a/text()").extract()
20             item["position_type"] = tr.xpath("./td[2]/text()").extract()
21             item["persons"] = tr.xpath("./td[3]/text()").extract()
22             item["place"] = tr.xpath("./td[4]/text()").extract()
23             item["time"] = tr.xpath("./td[5]/text()").extract()
24             link_part = tr.xpath("./td[1]/a/@href").extract_first()
25             # Analyze the structure of website and link the correct position details
26             url_detail = item["detail_link"] = 'https://hr.tencent.com/' + link_part
27             # Detailed links to be found yield reach scrapy The scheduler will enter the queue and send the requests in turn.
28             yield scrapy.Request(url=url_detail,
29                                       callback=self.parse_next_url,#Write callback functions to handle links
30                                       meta = {"item":item},
31                                       )
32         # Page turning operation
33         if self.offset < 2870:
34             self.offset += 10
35             url_send = self.original_url + str(self.offset)
36             yield scrapy.Request(
37                 url=url_send,
38                 callback=self.parse,
39                                  )
40     # Write callback function
41     def parse_next_url(self,response):
42         item = response.meta["item"]
43         item["work_duty"] = response.xpath("//table[@class='tablelist textl']//tr[3]//ul//text()").extract()
44         item["work_request"] = response.xpath("//table[@class='tablelist textl']//tr[4]//ul//text()").extract()
45         item["work_duty"] = re.sub(r'(\xa0)','',str(item["work_duty"]))
46         item["work_request"] = re.sub(r'(\xa0)','',str(item["work_request"]))
47         yield item

5. Write pipeline.py file to process the received data

 1 import json
 2 
 3
 4 class TencentPipeline(object):
 5     # Customize an open file and write the file to store data
 6     def __init__(self):
 7         self.f = open("tencent.json","wb")
 8 
 9     def process_item(self, item, spider):
10         # When item When there is Chinese in the document, ensure The default is to use ascii Code Chinese
11         content = json.dumps(dict(item),ensure_ascii= False) + ", \n"
12         self.f.write(content.encode("utf-8"))
13         return item
14 
15     def close_file(self):
16           self.f.close()

6. Set the setting.py file and configure the relevant content of the operation of the summary

DOWNLOADER_MIDDLEWARES = {
   'tencent.middlewares.RandomUA': 543,
}
ITEM_PIPELINES = {
   'tencent.pipelines.TencentPipeline': 300,
   #  'scrapy_redis.pipelines.RedisPipeline': 400,
}
USER_AGENT = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
# Set request delay operation

DOWNLOAD_DELAY = 1

7. Set the middlewars.py file to process the request.

class RandomUA(object):
# Set random request header
    def process_request(self,request,spider):
        UA = random.choice(USER_AGENT)
        request.headers["user-agent"] = UA

8. Set the startup file start.py of the crawler

1 from scrapy.cmdline import execute
2 execute("scrapy crawl hr".split())

9. The implementation effect is as follows. (string saved as json data format to local)

 1 {"position": ["25928-Senior Graphic Development Engineer (Shenzhen headquarter)"], "position_type": ["Technology category"], "persons": ["3"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46479&keywords=&tid=0&lid=0", "work_duty": "['Responsible for the development of game engine graphics related features;', 'Responsible for the optimization of rendering process and algorithm, as well as the development of related tools;', 'Responsible for graphic compatibility analysis and analysis and positioning of difficult problems.']", "work_request": "['Bachelor degree or above, proficient C/C++,With solid data structure and algorithm foundation, familiar with common design patterns;', 'Knowledge of computer graphics, proficient in 3 D Graphics rendering technology, familiar with OpenGL as well as Shader Development;', 'Proficient in 3 D Game engine architecture, familiar with 3 D Engine interface and game production process;', '3 More than 3 years D Engine ( Unreal,Unity Development experience, more than one year experience in rendering related development and optimization;', 'Deep understanding of the implementation of client framework and other core modules, experience in leading the development of core modules is preferred;', 'Familiar with mobile terminal GPU/CPU Architecture, mobile rendering development experience is preferred;', 'Strong sense of responsibility, good at communication, and enthusiasm for the application of game cutting-edge technology.']"}, 
 2 {"position": ["25667-Channel Sales Manager (Shenzhen)"], "position_type": ["Market class"], "persons": ["2"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46485&keywords=&tid=0&lid=0", "work_duty": "['As Tencent cloud channel manager, responsible for regional channel system construction and product sales;', 'Visit channel partners regularly, fully understand customer needs and actively follow up, formulate reasonable plans, be responsible for scheme prompt, negotiation, track the work of relevant departments of the company, and ensure the effective implementation of the plans;', 'Maintain a good business relationship with existing partners, update the company's product information in time, and convey the corporate and brand culture.']", "work_request": "['Bachelor degree or above, major in computer, telecommunication, marketing or other related fields;', 'More than five years working experience in software or Internet industry;', 'Rich experience in channel sales, regional management and long tail SME customer coverage;', 'Sales experience in enterprise application software, sales experience in cloud computing and Internet industry is preferred;', 'Be able to effectively cover middle and long tail customers through channels and undertake regional sales performance;', 'Be able to establish regional channel system, effectively handle channel conflict and risk prevention;', 'Be able to lead the development of various service and incentive methods, and continuously improve the satisfaction of channel partners;', 'With excellent coordination ability, good team spirit, integrity, dedication and sense of responsibility.']"}, 
 3 {"position": ["28481-Medical health UI Development Engineer(Shenzhen)"], "position_type": ["Design class"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46476&keywords=&tid=0&lid=0", "work_duty": "['Responsible for the preparation of front-end components of Tencent search, smart hospital and other related medical products, web Development work;', 'According to the product and design requirements, constantly optimize the front-end architecture and improve the user experience. Participation related UI The establishment and maintenance of component system.']", "work_request": "['Web page reconstruction or web More than 2 years of front-end development work; ', 'Master HTML5,CSS3,JavaScript Build high performance web Application; Mastery React or Vue And have relevant practical experience, master the mainstream front-end construction tools grunt,gulp,webpack;', 'Master UI Experience in component development, dynamic effect development, responsive, multi terminal adaptation and barrier free development;', 'Yes node.js/vue/react Development experience is preferred, and practical experience in front-end performance and tool development is preferred.', 'Yes Web Have a certain understanding of performance and safety; ', 'Have the spirit of innovation and be able to actively learn new technology in the industry, smooth communication and cooperation ability.']"}, 
 4 {"position": ["SA-Senior System Test Engineer of Tencent social advertising (R & D center, Beijing)"], "position_type": ["Technology category"], "persons": ["1"], "place": ["Beijing"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46486&keywords=&tid=0&lid=0", "work_duty": "['Participate in the whole process of Internet software product testing, including requirement analysis, design review, test plan development, test case design and implementation, defect tracking and software quality analysis, etc;', 'Develop test plan, build test environment, implement integration test, regression test, etc; ', 'Ensure the quality of the tested system, and strive to improve the quality and efficiency of R & D through the innovation of testing process and method.']", "work_request": "['Bachelor degree or above in engineering, computer or other related major;', 'be familiar with C/C++/Java Wait for at least one programming language,Yes Shell or Ruby/PHP/Perl/Python Experience is preferred;', 'At least 1 year working experience in software development and automation testing;', 'Experience in performance, safety, white box testing, etc. is preferred;', 'Experience in Internet advertising, search, big data processing, distributed system, database and network is preferred; ', 'be familiar with Linux or Unix Operating system;', 'Proficient in test process and test case design method,Be able to actively carry out technical research;', 'Ability to solve complex problems and write automatic test tools and systems;', 'Strong logical thinking ability,Negotiation ability and conflict management ability;', 'Good at teamwork,Understanding and adapting to change,Guided by results and actions,Strive for success.']"}, 
 5 {"position": ["25664-Government industry Delivery Project Manager"], "position_type": ["product/Project category"], "persons": ["2"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46484&keywords=&tid=0&lid=0", "work_duty": "['1,Responsible for the project delivery management of Tencent cloud government industry;', '2,Be responsible for the organization and coordination of project resources, and ensure the collaborative work of all stakeholders and internal and external cooperation teams of the project team; ', '3,Be responsible for the formulation, tracking and maintenance of the project plan, ensure the completion of the project according to the plan, and solve various problems in the delivery;', '4,Assist to collect customer needs and user feedback, drive the R & D team to improve the product, and ensure the project passes the acceptance smoothly.']", "work_request": "['1,Full time bachelor degree or above, more than 5 years of experience in government industry, at least in-depth participation in 5 large and medium-sized projects in government industry;', '2,Have working experience in large-scale enterprises, have managed more than 20 project teams, have rich experience in cross department, cross organization communication and coordination, and be able to deal with complex project environment;', '3,Familiar with R & D process, including product design, requirement analysis, architecture design, development, testing, operation and maintenance, and agile development process;', '4,Excellent communication ability and skills, able to find ways to promote the smooth progress of the project, with a strong sense of results oriented;', '5,Have good project management, customer relationship maintenance ability, excellent communication skills, and be able to properly coordinate the relationship among customers, partners and internal teams;', '6,Have a strong sense of career, responsibility and responsibility, have a strong pressure resistance ability, can handle multiple project work in parallel, can withstand a certain degree of travel or field work;', '7,Yes PMP,ITIL Certificate is preferred, system integration project manager certificate of MII is preferred.']"}, 
 6 {"position": ["PCG14-Applied treasure Data Mining Algorithm Engineer (Shenzhen)"], "position_type": ["Technology category"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46483&keywords=&tid=0&lid=0", "work_duty": "['Responsible for providing suitable recommendation algorithm model;', 'Responsible for researching the leading technologies in the industry, combining the data of Tencent's various business platforms, giving specific experimental data according to the two scenarios of application center and social channel, and evaluating the results;', 'Responsible for reporting statistical data required by business according to different algorithm models and assisting in the implementation of various algorithms;', 'The bottleneck of the existing algorithm is studied, and reasonable improvement measures and solutions are proposed.']", "work_request": "['Master's or doctor's degree in computer, applied mathematics, artificial intelligence, pattern recognition, statistics, automatic control, etc. is preferred;', '2 At least years relevant working experience;', 'Have a deep understanding of machine learning, data mining algorithm and its application on the Internet;', 'Familiar with C/C++Language;', 'Experience in large-scale distributed computing platform and parallel algorithm development;', 'Rigorous mathematical thinking, outstanding analysis and induction ability, excellent communication and expression ability;', 'Experience in Internet advertising, e-commerce and search is preferred.']"}, 
 7 {"position": ["19867-Game background Development Engineer (Shenzhen)"], "position_type": ["Technology category"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46482&keywords=&tid=0&lid=0", "work_duty": "['Responsible for game background architecture design;', 'Responsible for the development of game background system modules and new features;', 'Responsible for server performance optimization and experience optimization.']", "work_request": "['2 At least years working experience in game server background, with complete project experience;', 'Solid programming foundation, have a certain understanding of high online large concurrent game background architecture;', 'be familiar with Unix/Linux Operating system C/C++Development;', 'be familiar with TCP/IP Protocol related knowledge, familiar with network programming and database;', 'Understand game server architecture and performance optimization methods;', 'Good ability to analyze and solve problems, able to independently undertake background logic system development;', 'High sense of responsibility, good communication ability and team spirit.']"}, 
 8 {"position": ["TME-Whole nation K Song Project Manager (Shenzhen)"], "position_type": ["product/Project category"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46478&keywords=&tid=0&lid=0", "work_duty": "['Responsible for the whole nation K Song version planning, risk monitoring, process tracking, to ensure the realization of version objectives;', 'Be responsible for FT Internal target confirmation, target disassembly, resource allocation and target achievement follow-up to promote the collaborative work of all roles;', 'Find, summarize and track process problems, promote continuous improvement of all aspects of the team and improve efficiency.']", "work_request": "['Bachelor degree or above, major in computer or related;', '3 More than years of software project management experience, with Internet, software technology development experience is preferred;', 'Proficient in software project process management, practical application experience and deep understanding of agile project management;', 'Good executive ability and sense of responsibility, can promote the project team to move towards the goal;', 'With rich ability of communication, communication and organization, excellent team spirit;']"}, 
 9 {"position": ["25928-Senior speech Algorithm Engineer (Shanghai)"], "position_type": ["Technology category"], "persons": ["1"], "place": ["Shanghai"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46477&keywords=&tid=0&lid=0", "work_duty": "['Responsible for game voice algorithm optimization;', 'Responsible for the research of voice cutting-edge technology;', 'Responsible for the maintenance and upgrading of the existing version algorithm of game voice.']", "work_request": "['Bachelor degree or above;', 'be familiar with PC/Android/iOS SDK Any platform C/C++Development, performance optimization; ', 'Familiar with digital signal processing, solid mathematical skills, familiar with MATLAB Simulation; ', 'Familiar with speech preprocessing algorithm AEC,AGC,VAD,NS,CNG,JitterBuffer,Mix Algorithm;', 'Familiar with common Codec,Opus/AAC/Speex etc.;', 'Yes AI Voice pre-processing experience is preferred;', 'be familiar with WebRTC,Speex,Opus Etc.']"}, 
10 {"position": ["HY-Game distribution/Operation Trainee (Shenzhen)"], "position_type": ["product/Project category"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46481&keywords=&tid=0&lid=0", "work_duty": "['Game distribution/The operation trainee program is dedicated to training high potential game operation talents to meet the needs of fast-growing game business;', 'The project adopts customized training mode, through special class learning, tutoring by famous teachers and project practice, to improve students' products sense,Operation ability and general quality, make students become excellent game distribution/Operational talent.', 'Assist the project producer and the R & D provider to jointly formulate the operation objectives and work plans, and agree on the game optimization, operation development and operation support in each stage;', 'Promote daily communication between game developers, pay close attention to the progress of R & D and operation preparation and provide necessary assistance;', 'Guide and support the daily work of different functional employees in the project team, and promote the development of objectives and work plans of the cooperation department;', 'According to the needs of the project, formulate and promote the project process specification to ensure the orderly progress of the project;', 'Timely find and track project problems, effectively manage project risks.']", "work_request": "['Love and enjoy the experience of games, have a certain understanding of R & D and operation, and maintain a strong curiosity and curiosity;', 'Excellent Chinese and English reading and writing skills;', 'Be proactive and able to work under high pressure.']"}, 
11 {"position": ["25928-Senior Graphic Development Engineer (Shanghai)"], "position_type": ["Technology category"], "persons": ["3"], "place": ["Shanghai"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46480&keywords=&tid=0&lid=0", "work_duty": "['Responsible for the development of game engine graphics related features;', 'Responsible for the optimization of rendering process and algorithm, as well as the development of related tools;', 'Responsible for graphic compatibility analysis and analysis and positioning of difficult problems.']", "work_request": "['Bachelor degree or above, proficient C/C++,With solid data structure and algorithm foundation, familiar with common design patterns;', 'Knowledge of computer graphics, proficient in 3 D Graphics rendering technology, familiar with OpenGL as well as Shader Development;', 'Proficient in 3 D Game engine architecture, familiar with 3 D Engine interface and game production process;', '3 More than 3 years D Engine ( Unreal,Unity Development experience, more than one year experience in rendering related development and optimization;', 'Deep understanding of the implementation of client framework and other core modules, experience in leading the development of core modules is preferred;', 'Familiar with mobile terminal GPU/CPU Architecture, mobile rendering development experience is preferred;', 'Strong sense of responsibility, good at communication, and enthusiasm for the application of game cutting-edge technology.']"}, 
12 {"position": ["28481-Senior manager of health insurance industry cooperation (Shenzhen)"], "position_type": ["Market class"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46474&keywords=&tid=0&lid=0", "work_duty": "['1,Be responsible for the development of customer resources in commercial insurance industry (including but not limited to insurance companies, innovative insurance platforms, industry associations and other professional fields);', '2,Expand relevant industry partners and cooperation institutions, integrate existing products and resources of the company, and form scenario based innovative solutions;', '3,Collect and sort out the market dynamics, policy changes and other industry information of the health insurance industry, interpret the feedback to boost the business strategy formulation;', '4,Integrate resources, design, formulate and promote the implementation of business cooperation programs, and effectively leverage industry resource cooperation.']", "work_request": "['1,Bachelor degree or above, at least 3 years working experience in health insurance field;', '2,Familiar with insurance industry, have health insurance innovative product operation experience or innovative platform operation experience; ', '3,Good communication and expression ability, clear thinking logic, keen insight, strong self drive and execution ability; ', '4,Have a high sense of responsibility and passion for work, pay attention to teamwork, adapt to frequent travel needs.']"}, 
13 {"position": ["25928-Front end test development engineer (Shenzhen)"], "position_type": ["Technology category"], "persons": ["1"], "place": ["Shenzhen"], "time": ["2018-12-19"], "detail_link": "https://hr.tencent.com/position_detail.php?id=46473&keywords=&tid=0&lid=0", "work_duty": "['Responsible for the test and development of platform software;', 'Responsible for interface test and unit test of platform components;', 'Be able to provide technical guidance and support to the team in key technologies;', 'Complete the mobile development task on time;', 'Be responsible for the coordination with the project team, promote the work, and help the project team to promote the quality of the whole project.']", "work_request": "['Bachelor's degree or above, major in computer related, development or test development background, basic in software development;', '2-3 At least years experience in software industry or Internet industry, familiar with Windows Programming or Android/iOS Programming;', 'Familiar with software development process Android/iOS Automatic test technology in environment;', 'Solid test case design ability, familiar with mainstream automation methods;', 'Solid C++,C or Object-C Basis of programming.', 'Experience in any of the following is preferred:', 'Stronger DEBUG Ability;', 'Yes Android,iOS Product automation test experience.']"}, 
......

Topics: Python PHP Windows Programming Mobile