1. Introduction of Data Acquisition Tools
Most of the dynamic websites are now launched by the browser through js ajax requests, get the data and then render to complete the page display. In this case, it is not feasible to collect data, initiate http get request through script, and parse and extract useful data after getting the DOM document page. Then someone will think of using F12 to open the browser console to analyze the server api, and then simulate the corresponding API to get the data we want. This idea is feasible in some cases, but many large websites will adopt some anti-crawling strategies. For security reasons, security verification is often added to the interface. For example, only with relevant header s and cookie s, can the page be requested; and there are also restrictions on the source of requests, etc., at this time, it is more difficult to collect data in this way. Do we have any other effective methods? Of course, python is a very simple crawler. Let's first learn about Selenium and Selectors, and then summarize some techniques of data collection through examples of crawling business information on the Art Corps website.
- Selenium is an open source testing framework for automated testing of web applications (e.g. websites), because it can drive browsers, such as Chrome, Firefox, IE, etc., so it can be a more realistic simulator to automatically click on various buttons, turn pages, fill in forms, etc. We use python to drive Selenium. Web driver, which can drive browsers, can directly get the rendered DOM documents, saving a lot of time.
- Selectors is a mechanism for data extraction by Scrapy(Python's crawler framework). Called a selector, you can "select" a part of an HTML file through specific XPath or CSS expressions. It is very convenient to analyze and extract valid data from DOM documents. And XPath is the W3C standard, so the method of extracting data using Selectors is universal.
2. Page crawling data analysis and data table creation
I take a gourmet restaurant in Dayue City near Chaoyang as an example to collect data. The website is:
https://www.meituan.com/meishi/40453459/
2.1 Grab Data
The first part of the data we want to capture is the basic information of the business, including the name, address, telephone, business hours, analysis of many food businesses, we know that the layout of these businesses'web interface is basically the same, so our reptiles can write more general. In order to prevent duplicate crawling of business data, we also store the business website information in the data table.
The second part is to capture the data of the specialty dishes of the gourmet restaurants. Each shop has its own specialty dishes. We will save these data and store them in another data table.
The last part of the data we want to capture is the user's comments. This part of the data is very valuable to us. In the future, we can extract more information about the business through the analysis of this part of the data. This part of the information we want to capture is: commentator nickname, star, comment content, comment time, if there are pictures, we also need to save the address of the picture in the form of a list.
2.2 Create Data Table
The database we use to store data is Mysql,Python has related ORM, and we use peewee in our project. However, it is recommended that native sql be used when creating data tables, so that we can flexibly control field attributes, set engine and character encoding format, etc. ORM using Python can also achieve results, but ORM is the encapsulation of the database layer, such as sqlite, sqlserver database and Mysql is slightly different, using ORM can only use the common parts of these databases. The following is the data table sql used to store data:
CREATE TABLE `merchant` ( #Business List `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(255) NOT NULL COMMENT 'Business name', `address` varchar(255) NOT NULL COMMENT 'address', `website_address` varchar(255) NOT NULL COMMENT 'Website', `website_address_hash` varchar(32) NOT NULL COMMENT 'Website hash', `mobile` varchar(32) NOT NULL COMMENT 'Telephone', `business_hours` varchar(255) NOT NULL COMMENT 'Business Hours', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8mb4; CREATE TABLE `recommended_dish` ( #Recommended menu `id` int(11) NOT NULL AUTO_INCREMENT, `merchant_id` int(11) NOT NULL COMMENT 'business id', `name` varchar(255) NOT NULL COMMENT 'Name of recommended dishes', PRIMARY KEY (`id`), KEY `recommended_dish_merchant_id` (`merchant_id`), CONSTRAINT `recommended_dish_ibfk_1` FOREIGN KEY (`merchant_id`) REFERENCES `merchant` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=309 DEFAULT CHARSET=utf8mb4; CREATE TABLE `evaluate` ( #Comments `id` int(11) NOT NULL AUTO_INCREMENT, `merchant_id` int(11) NOT NULL COMMENT 'business id', `user_name` varchar(255) DEFAULT '' COMMENT 'Commentator nickname', `evaluate_time` datetime NOT NULL COMMENT 'Comment time', `content` varchar(10000) DEFAULT '' COMMENT 'Comments', `star` tinyint(4) DEFAULT '0' COMMENT 'Star class', `image_list` varchar(1000) DEFAULT '' COMMENT 'Picture List', PRIMARY KEY (`id`), KEY `evaluate_merchant_id` (`merchant_id`), CONSTRAINT `evaluate_ibfk_1` FOREIGN KEY (`merchant_id`) REFERENCES `merchant` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=8427 DEFAULT CHARSET=utf8mb4;
Correspondingly, we can also use Python ORM to create management data tables. When we analyze the code, we will talk about some common operations of peewee on mysql database, such as querying data, inserting database data and returning id, batch inserting database, etc. Readers can collect relevant information system learning.
meituan_spider/models.py code:
from peewee import * # Connect to the database db = MySQLDatabase("meituan_spider", host="127.0.0.1", port=3306, user="root", password="root", charset="utf8") class BaseModel(Model): class Meta: database = db # Business List, used to store business information class Merchant(BaseModel): id = AutoField(primary_key=True, verbose_name="business id") name = CharField(max_length=255, verbose_name="Business name") address = CharField(max_length=255, verbose_name="Business Address") website_address = CharField(max_length=255, verbose_name="network address") website_address_hash = CharField(max_length=32, verbose_name="Network Address md5 Value, for fast indexing") mobile = CharField(max_length=32, verbose_name="Business telephone") business_hours = CharField(max_length=255, verbose_name="Business Hours") # Businessmen recommend tables and store recommendation information of dishes class Recommended_dish(BaseModel): merchant_id = ForeignKeyField(Merchant, verbose_name="Business Foreign Key") name = CharField(max_length=255, verbose_name="Name of recommended dishes") # User evaluation table, which stores user's comment information class Evaluate(BaseModel): id = CharField(primary_key=True) merchant_id = ForeignKeyField(Merchant, verbose_name="Business Foreign Key") user_name = CharField(verbose_name="User name") evaluate_time = DateTimeField(verbose_name="Evaluation time") content = TextField(default="", verbose_name="Comments") star = IntegerField(default=0, verbose_name="score") image_list = TextField(default="", verbose_name="picture") if __name__ == "__main__": db.create_tables([Merchant, Recommended_dish, Evaluate])
3. Code Implementation and Explanation
The code is relatively simple, but to make the code run, you need to install the toolkits mentioned earlier: selenium, scrapy, and peewee, which can be installed through pip; in addition, selenium driver browser also needs to install the driver, because I use Chrome browser locally, so I downloaded the relevant version of chrome driver, which will be used later. Readers are invited to refer to the preparations for python's selenium operation and build the relevant environment manually. Next, the code is analyzed in detail; the source code is as follows:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException from scrapy import Selector from models import * import hashlib import os import re import time import json chrome_options = Options() # Setting headless mode, which has no startup interface, can speed up the operation of the program # chrome_options.add_argument("--headless") # Disable gpu to prevent rendering pictures chrome_options.add_argument('disable-gpu') # Set not to load pictures chrome_options.add_argument('blink-settings=imagesEnabled=false') # Stars are calculated by the number of pixels displayed on the page def star_num(num): numbers = { "16.8": 1, "33.6": 2, "50.4": 3, "67.2": 4, "84": 5 } return numbers.get(num, 0) # Analysis of Business Contents def parse(merchant_id): weblink = "https://www.meituan.com/meishi/{}/".format(merchant_id) # Start selenium browser = webdriver.Chrome(executable_path="/Users/guozhaoran/python/tools/chromedriver", options=chrome_options) browser.get(weblink) # Do not crawl data repeatedly hash_weblink = hashlib.md5(weblink.encode(encoding='utf-8')).hexdigest() existed = Merchant.select().where(Merchant.website_address_hash == hash_weblink) if (existed): print("Data crawled") os._exit(0) time.sleep(2) # print(browser.page_source) #Get the rendered content of the web page sel = Selector(text=browser.page_source) # Extracting the basic information of the business # Business name name = "".join(sel.xpath("//div[@id='app']//div[@class='d-left']//div[@class='name']/text()").extract()).strip() detail = sel.xpath("//div[@id='app']//div[@class='d-left']//div[@class='address']//p/text()").extract() address = "".join(detail[1].strip()) mobile = "".join(detail[3].strip()) business_hours = "".join(detail[5].strip()) # Preserving Business Information merchant_id = Merchant.insert(name=name, address=address, website_address=weblink, website_address_hash=hash_weblink, mobile=mobile, business_hours=business_hours ).execute() # Getting Recommended Vegetable Information recommended_dish_list = sel.xpath( "//div[@id='app']//div[@class='recommend']//div[@class='list clear']//span/text()").extract() # Traverse the acquired data and insert it into the database in batches dish_data = [{ 'merchant_id': merchant_id, 'name': i } for i in recommended_dish_list] Recommended_dish.insert_many(dish_data).execute() # You can also traverse the list and insert one bar into the database. # for dish in recommended_dish_list: # Recommended_dish.create(merchant_id=merchant_id, name=dish) # See how many pages of comments the link contains page_num = 0 try: page_num = sel.xpath( "//div[@id='app']//div[@class='mt-pagination']//ul[@class='pagination clear']//li[last()-1]//span/text()").extract_first() page_num = int("".join(page_num).strip()) # page_num = int(page_num) except NoSuchElementException as e: print("Businessmen do not have user comment information") os._exit(0) # When there is user comment data, each page reads user data per page if (page_num): i = 1 number_pattern = re.compile(r"\d+\.?\d*") chinese_pattern = re.compile(u"[\u4e00-\u9fa5]+") illegal_str = re.compile(u'[^0-9a-zA-Z\u4e00-\u9fa5.,,. ?""]+', re.UNICODE) while (i <= page_num): # Get comment area elements all_evalutes = sel.xpath( "//div[@id='app']//div[@class='comment']//div[@class='com-cont']//div[2]//div[@class='list clear']") for item in all_evalutes: # Getting User Nicknames user_name = item.xpath(".//div[@class='info']//div[@class='name']/text()").extract()[0] # Obtaining User Evaluation Stars star = item.xpath( ".//div[@class='info']//div[@class='source']//div[@class='star-cont']//ul[@class='stars-ul stars-light']/@style").extract_first() starContent = "".join(star).strip() starPx = number_pattern.search(starContent).group() starNum = star_num(starPx) # Get comment time comment_time = "".join( item.xpath(".//div[@class='info']//div[@class='date']//span/text()").extract_first()).strip() evaluate_time = chinese_pattern.sub('-', comment_time, 3)[:-1] + ' 00:00:00' # Get comments comment_content = "".join( item.xpath(".//div[@class='info']//div[@class='desc']/text()").extract_first()).strip() comment_filter_content = illegal_str.sub("", comment_content) # If you have pictures, get them image_container = item.xpath( ".//div[@class='noShowBigImg']//div[@class='imgs-content']//div[contains(@class, 'thumbnail')]//img/@src").extract() image_list = json.dumps(image_container) Evaluate.insert(merchant_id=merchant_id, user_name=user_name, evaluate_time=evaluate_time, content=comment_filter_content, star=starNum, image_list=image_list).execute() i = i + 1 if (i < page_num): next_page_ele = browser.find_element_by_xpath( "//div[@id='app']//div[@class='mt-pagination']//span[@class='iconfont icon-btn_right']") next_page_ele.click() time.sleep(10) sel = Selector(text=browser.page_source) if __name__ == "__main__": parse("5451106")
3.1 Start webdriver and set optimization parameters
In order to make the crawler more general, our analytic function picks up the web content of different businesses by receiving the "parameter id" of the business. selenium drives web browsers through webdriver:
weblink = "https://www.meituan.com/meishi/{}/".format(merchant_id) # Start selenium browser = webdriver.Chrome(executable_path="/Users/guozhaoran/python/tools/chromedriver", options=chrome_options) browser.get(weblink)
Excutable_path is the chrome driver executable that we downloaded before. In addition, selenium can set some parameters before launching the web browser:
chrome_options = Options() # Setting headless mode, which has no startup interface, can speed up the operation of the program # chrome_options.add_argument("--headless") # Disable gpu to prevent rendering pictures chrome_options.add_argument('disable-gpu') # Set not to load pictures chrome_options.add_argument('blink-settings=imagesEnabled=false')
Settings -- headless lets chrome not start the front-end interface, which is a bit like a daemon process, but we can debug the code without setting this parameter, so that we can see what the program does to the web pages in the browser. In addition, we can improve the speed of browser rendering web pages by disable-gpu, blink-settings=imagesEnabled=false so that the browser does not load images in the process of parsing web pages; because the image data stored in our data is only the path. One disadvantage of selenium as a crawler is its low efficiency and slow crawling speed. However, by setting these optimization parameters, the crawler's grasping speed can be greatly improved.
3.2 Extracting the Basic Information of Businessmen
As mentioned earlier, in order not to crawl data repeatedly, we will have hash checks on businesses to crawl:
# Do not crawl data repeatedly hash_weblink = hashlib.md5(weblink.encode(encoding='utf-8')).hexdigest() existed = Merchant.select().where(Merchant.website_address_hash == hash_weblink) if (existed): print("Data crawled") os._exit(0)
If the business data has not been crawled, we can get the web data to parse:
time.sleep(2) # print(browser.page_source) #Get the rendered content of the web page sel = Selector(text=browser.page_source)
sleep takes two seconds because it takes time for browser objects to parse web pages, but this time is usually very fast. This is to make the program more stable. Then a selector is constructed to parse the page data:
# Extracting the basic information of the business # Business name name = "".join(sel.xpath("//div[@id='app']//div[@class='d-left']//div[@class='name']/text()").extract()).strip() detail = sel.xpath("//div[@id='app']//div[@class='d-left']//div[@class='address']//p/text()").extract() address = "".join(detail[1].strip()) mobile = "".join(detail[3].strip()) business_hours = "".join(detail[5].strip()) # Preserving Business Information merchant_id = Merchant.insert(name=name, address=address, website_address=weblink, website_address_hash=hash_weblink, mobile=mobile, business_hours=business_hours ).execute()
To parse the basic business information, we locate the relevant elements through xpath grammar and extract the text information. In order to ensure that the extracted data are not empty strings, string splicing is carried out. Finally, the parsed data is inserted into the business data table. The insert method of peewee returns the primary key id, and when the data is collected in the database later. It will be used.
3.3 Extracting the Information of Merchant's Special Cuisine
The logic of extracting information from specialty dishes is relatively simple, and the extracted data is returned to a list. It is very convenient for python to parse the data type. However, there are different schemes when data is stored, which can be inserted in batches or in circular traversal lists. Here we use batch insertion. This will be more efficient.
# Getting Recommended Vegetable Information recommended_dish_list = sel.xpath( "//div[@id='app']//div[@class='recommend']//div[@class='list clear']//span/text()").extract() # Traverse the acquired data and insert it into the database in batches dish_data = [{ 'merchant_id': merchant_id, 'name': i } for i in recommended_dish_list] Recommended_dish.insert_many(dish_data).execute() # You can also traverse the list and insert one bar into the database. # for dish in recommended_dish_list: # Recommended_dish.create(merchant_id=merchant_id, name=dish)
3.4 Paging Extraction of User Comment Information
User information extraction is the most difficult part of data capture, the basic idea is that we first look at how many pages of user reviews, and then analyze the user reviews page by page. During this period, we can turn pages through selenium to simulate the click events of browsers. When entering the library, we should also pay attention to cleaning the text, because a lot of emotive characters in the comments do not conform to the coding specifications of data table field design. In addition, after clicking on the next page, the program must sleep for a period of time, because the website. Data has been updated to retrieve page data. Let's first look at how to get a total number of pages of user comment data. The page chart of the website is as follows:
Here we focus on two buttons, one is the next page, the other is the number of the last page, which is the information we want, but some businesses may not have relevant user comments, there are no relevant elements on the page, the program still needs to do compatibility processing:
# See how many pages of comments the link contains page_num = 0 try: page_num = sel.xpath( "//div[@id='app']//div[@class='mt-pagination']//ul[@class='pagination clear']//li[last()-1]//span/text()").extract_first() page_num = int("".join(page_num).strip()) # page_num = int(page_num) except NoSuchElementException as e: print("Businessmen do not have user comment information") os._exit(0)
The next step is to get commentary data like shopping mall specialties, but the process is rather cumbersome. Our basic idea is as follows:
if (page_num): i = 1 ... while (i <= page_num): ... i = i + 1 if (i < page_num): next_page_ele = browser.find_element_by_xpath( "//div[@id='app']//div[@class='mt-pagination']//span[@class='iconfont icon-btn_right']") next_page_ele.click() time.sleep(10) sel = Selector(text=browser.page_source)
We determine whether the program parses to the last page, if not, by simulating click on the next page to get a new page, the program sleep is to give browsers time to parse the new page data. Detailed analysis of the process we pick a few key points to say:
- It is not directly acquired to get the user's comment stars, but by getting the css width of the star rating elements and calculating through functions:
# Stars are calculated by the number of pixels displayed on the page def star_num(num): numbers = { "16.8": 1, "33.6": 2, "50.4": 3, "67.2": 4, "84": 5 } return numbers.get(num, 0) ... # Obtaining User Evaluation Stars star = item.xpath( ".//div[@class='info']//div[@class='source']//div[@class='star-cont']//ul[@class='stars-ul stars-light']/@style").extract_first() starContent = "".join(star).strip() starPx = number_pattern.search(starContent).group() starNum = star_num(starPx)
- User comments may contain illegal characters. Programs filter through regular tables. Regular expressions generally use the re module in python. Pre-compiling can improve performance. In addition, these operations should be placed outside while for these loops:
number_pattern = re.compile(r"\d+\.?\d*") chinese_pattern = re.compile(u"[\u4e00-\u9fa5]+") illegal_str = re.compile(u'[^0-9a-zA-Z\u4e00-\u9fa5.,,. ?""]+', re.UNICODE) while (i <= page_num): ... comment_content = "".join( item.xpath(".//div[@class='info']//div[@class='desc']/text()").extract_first()).strip() comment_filter_content = illegal_str.sub("", comment_content)
- When users comment, there may be more than one picture. We only get the path of the picture and save it in the form of json compression in the data table field.
image_container = item.xpath( ".//div[@class='noShowBigImg']//div[@class='imgs-content']//div[contains(@class, 'thumbnail')]//img/@src").extract() image_list = json.dumps(image_container)
4. Reflective summary
Below is a screenshot of data grabbing during program operation.
The idea of the program is very concise. In real enterprise applications, the efficiency and stability of crawlers may be more considered. General linux server program errors, will record the relevant logs, selenium can only be run without interface, the program does not use too many advanced features, in fact, a crawler architecture to contain many technical points, such as multi-threaded data crawling, as well as for verification code anti-crawling (this book) In the example, the first time to open the beauty group page also needs to be verified, I handled it manually once) and so on, here is a purpose of attracting valuable ideas. However, the text processing techniques and data analysis and extraction used in the program are often used in crawlers. I'm glad to share them with you here.