1. College Entrance Examination School College Data - Write in front
Finally, scrapy crawler framework has been written. This framework can be said to be the most popular one in python crawler framework. Next, we will focus on its usage rules.
Installation process Baidu itself, you can find more than three installation methods, which can be installed on.
You can refer to the official instructions of https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html for installation.
2. College Entrance Examination Data-Creation of Scapy Project
Generic use of the following commands to create
scrapy startproject mySpider
After completion, the catalog structure of your project is
The corresponding meaning of each file is
- Configuration file for scrapy.cfg project
- mySpider / root directory
- The object file of the mySpider/items.py project, which specifies the data format, is used to define the attributes or fields corresponding to the parsed object.
- The pipeline file of mySpider/pipelines.py project is responsible for processing item s extracted by spider. Typical processes are cleanup, validation, and persistence (such as access to a database)
- Settings file for mySpider/settings.py project
- mySpider/spiders/crawler home directory
- middlewares.py Spider middleware is a specific hook between the engine and Spider that handles spider input (response) and output (items and requests). It provides a simple mechanism to extend Scrapy by inserting custom code. This article does not cover it.
College Entrance Examination School College Data Creating Scrapy Reptiles
Go to the mySpider/spiders / directory from the command line and execute the following commands
scrapy genspider GaoKao "www.gaokaopai.com"
Open GaoKao in mySpider/spiders/directory and add the following code by default
import scrapy ''' //What problems do you not understand? Python Learning Exchange Group: 821460695 to meet your needs, information has been uploaded group files, you can download! ''' class GaoKaoSpider(scrapy.Spider): name = "GaoKao" allowed_domains = ["www.gaokaopai.com"] start_urls = ['http://www.gaokaopai.com/'] def parse(self, response): pass
The default generated code contains a GaoKaoSpider class inherited from scrapy.Spider
And three properties and one method are implemented by default.
- Name = "" This is the name of the reptile. It must be unique. Different names need to be defined for different reptiles.
- allowed_domains = domain name range, restricting crawlers to crawl pages under the current domain name
- start_urls = crawled URL tuples / lists. The crawler starts crawling data from here. The first page crawled is from here. Other URLs will be generated from the results of crawling these initial URLs.
- parse(self,response) is a method of parsing web pages. When each initial URL is downloaded, it will be invoked. When invoked, the Response object returned from each initial URL will be passed as the only parameter. The main function of parse is to parse the returned web page data, response.body 2, and generate the URL request for the next page.
University Data of College Entrance Examination School--The First Case
The data we want to crawl is http://www.gaokaopai.com/rank-index.html.
There is a load more at the bottom of the page. Click on the crawl link.
The embarrassing thing happened. It was a POST request. It was intended to implement a GET. This time, the amount of code was a bit too large.~
The scrapy mode is a GET request. If we need to modify it to POST, we need to rewrite the start_requests(self) method of the Spider class and not call the url in start_urls anymore. So, let's make some changes to the code. After rewriting the code, notice the following code
request = FormRequest(self.start_url,headers=self.headers,formdata=form_data,callback=self.parse)
- FormRequest needs to introduce modules from scrapy import FormRequest
- Write the address of the post request on self.start_url
- formdata is used to submit form data
- callback calls page parsing parameters
- The final yield request indicates that the function is a generator
import scrapy from scrapy import FormRequest import json ''' //What problems do you not understand? Python Learning Exchange Group: 821460695 to meet your needs, information has been uploaded group files, you can download! ''' from items import MyspiderItem class GaokaoSpider(scrapy.Spider): name = 'GaoKao' allowed_domains = ['gaokaopai.com'] start_url = 'http://www.gaokaopai.com/rank-index.html' def __init__(self): self.headers = { "User-Agent":"Find one for yourself UA", "X-Requested-With":"XMLHttpRequest" } # You need to override the start_requests() method def start_requests(self): for page in range(0,7): form_data = { "otype": "4", "city":"", "start":str(25*page), "amount": "25" } request = FormRequest(self.start_url,headers=self.headers,formdata=form_data,callback=self.parse) yield request def parse(self, response): print(response.body) print(response.url) print(response.body_as_unicode())
In def parse(self, response): function, output the content of the web page, this place, need to use a knowledge point.
Get the page content response.body response.body_as_unicode()
- response.url gets the captured rul
- response.body retrieves the content byte type of a web page
- response.body_as_unicode() Gets the type of the content string of the Web site
Now we can run the crawler program.
Create an begin.py file in the project root directory, which writes the following code
from scrapy import cmdline cmdline.execute(("scrapy crawl GaoKao").split())
Running this file, remember that in other py files in scrapy, running will not show the corresponding results. Every time you test, you need to run begin.py. Of course, you can name it by another name.
If you don't, then you can only use the following operation, which is more troublesome.
cd to the crawler directory to execute the scrapy crawl Gao Kao-nolog command Description: scrapy crawl Gao Kao (Gao Kao denotes the name of the crawler) -- nolog(--nolog denotes no log display)
Running, the data is printed on the console. It's easy to test. You can change the number 7 in the above code to 2. The interested person can see my small text.
pycharm prints a lot of red words on the console during its operation. That's OK. That's not BUG.
Be sure to find the black word in the middle of the red word. The black word is the data you print out. As follows, if you get such content, you will be more than half successful.
import scrapy class MyspiderItem(scrapy.Item): # School Name uni_name = scrapy.Field() uni_id = scrapy.Field() city_code = scrapy.Field() uni_type = scrapy.Field() slogo = scrapy.Field() # Admission Difficulty safehard = scrapy.Field() # Location of Institutions rank = scrapy.Field()
Then, in the GaokaoSpider class just now, we continue to improve the parse function to determine whether the page is HTML or JSON format by judging response.headers ["Content-Type"].
if(content_type.find("text/html")>0): # print(response.body_as_unicode()) trs = response.xpath("//table[@id='results']//tr")[1:] for item in trs: school = MyspiderItem() rank = item.xpath("td[1]/span/text()").extract()[0] uni_name = item.xpath("td[2]/a/text()").extract()[0] safehard = item.xpath("td[3]/text()").extract()[0] city_code = item.xpath("td[4]/text()").extract()[0] uni_type = item.xpath("td[6]/text()").extract()[0] school["uni_name"] = uni_name school["uni_id"] = "" school["city_code"] = city_code school["uni_type"] = uni_type school["slogo"] = "" school["rank"] = rank school["safehard"] = safehard yield school else: data = json.loads(response.body_as_unicode()) data = data["data"]["ranks"] # get data for item in data: school = MyspiderItem() school["uni_name"] = item["uni_name"] school["uni_id"] = item["uni_id"] school["city_code"] = item["city_code"] school["uni_type"] = item["uni_type"] school["slogo"] = item["slogo"] school["rank"] = item["rank"] school["safehard"] = item["safehard"] # Give the acquired data to pipelines, which are defined in settings.py yield school
Execution mechanism of parse() method
- Use yield to return data, not return. In this way, parse will be treated as a generator. scarpy returns parse-generated data one by one
- If the return value is request, join the crawl queue, if it is item type, give it to pipeline, and other types report errors.
Here, if you want data preparation to go into pipeline, you need to turn configuration on in set. PY
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'mySpider.pipelines.MyspiderPipeline': 300, }
Write pipeline.py file at the same time
import os import csv class MyspiderPipeline(object): def __init__(self): # csv file store_file = os.path.dirname(__file__)+"/spiders/school1.csv" self.file = open(store_file,"a+",newline='',encoding="utf-8") self.writer = csv.writer(self.file) def process_item(self, item, spider): try: self.writer.writerow(( item["uni_name"], item["uni_id"], item["city_code"], item["uni_type"], item["slogo"], item["rank"], item["safehard"] )) except Exception as e: print(e.args) def close_spider(self,spider): self.file.close()
Well, the code is all written, it's still relatively simple. Modify the number above to 7, why 7, because only the first 150 data can be obtained.