What is AJAX:
How to get ajax data:
Directly analyze the interface called by ajax. Then request this interface through code.
Use Selenium+chromedriver to simulate browser behavior and obtain data.
Advantages and disadvantages of the method
The analysis interface can request data directly. There is no need to do some parsing. Less code and high performance. The analysis interface is complex, especially some interfaces confused through js. You should have a certain js foundation. Easily found as reptiles.
Selenium directly simulates the behavior of the browser. What the browser can request, you can also request using selenium. Reptiles are more stable. A lot of code. Low performance.
Selenium+chromedriver to obtain dynamic data:
Selenium is equivalent to a robot. It can simulate some human behaviors on the browser and automatically process some behaviors on the browser, such as clicking, filling in data, deleting cookie s and so on. chromedriver is a driver of Chrome browser. Only by using it can you drive the browser. Of course, there are different drivers for different browsers. The following lists the different browsers and their corresponding drivers:
For more explanations on the use of pychar activation code tutorial, see: https://vrg123.com
To install Selenium and chromedriver:
Install Selenium: Selenium is available in many languages, including java, ruby, python, etc. We can download the python version
pip install selenium
Install chromedriver: after downloading, put it into the pure English directory without permission.
Now let's take a simple example of getting Baidu's home page. Let's talk about how Selenium and chromedriver can get started quickly:
from selenium import webdrive
Absolute path of chromedriver
driver_path = r'D:\ProgramApp\chromedriver\chromedriver.exe'
Initialize a driver and specify the path of the chromedriver
driver = webdriver.Chrome(executable_path=driver_path)
Request web page
Through page_source to get the web page source code
Common operations of selenium:
For more tutorials, please refer to: http://selenium-python.readthedocs.io/installation.html#introduction
driver.close(): close the current page.
driver.quit(): exit the entire browser
find_element_by_id: find an element by id. Equivalent to:
submitTag = driver.find_element_by_id('su') submitTag1 = driver.find_element(By.ID,'su')
find_element_by_class_name: find elements by class name. Equivalent to:
submitTag = driver.find_element_by_class_name('su') submitTag1 = driver.find_element(By.CLASS_NAME,'su')
find_element_by_name: find the element according to the value of the name attribute. Equivalent to:
submitTag = driver.find_element_by_name('email') submitTag1 = driver.find_element(By.NAME,'email')
find_element_by_tag_name: find elements by tag name. Equivalent to:
submitTag = driver.find_element_by_tag_name('div') submitTag1 = driver.find_element(By.TAG_NAME,'div')
find_element_by_xpath: get elements according to xpath syntax. Equivalent to:
submitTag = driver.find_element_by_xpath('//div') submitTag1 = driver.find_element(By.XPATH,'//div')
find_element_by_css_selector: select elements according to CSS selector. Equivalent to:
submitTag = driver.find_element_by_css_selector('//div') submitTag1 = driver.find_element(By.CSS_SELECTOR,'//div')
Note that find_element is the first element that meets the condition. find_elements is to get all the elements that meet the conditions.
Action form elements:
Operation input box: divided into two steps. Step 1: find this element. Step 2: use send_keys(value) to fill in the data. The example code is as follows:
inputTag = driver.find_element_by_id('kw') inputTag.send_keys('python')
Use the clear method to clear the contents of the input box. The example code is as follows:
Operate checkbox: because you want to select the checkbox tab, you click it in the web page. Therefore, if you want to select the checkbox tag, select the tag first, and then execute the click event. The example code is as follows:
rememberTag = driver.find_element_by_name("rememberMe") rememberTag.click()
Select: select elements cannot be clicked directly. Because you need to select elements after clicking. At this time, selenium provides a class selenium for the select tag webdriver. support. ui. Select. Pass the obtained element as a parameter to this class and create this object. You can use this object for selection later. The example code is as follows:
from selenium.webdriver.support.ui import Select
Select this tag and use select to create the object
selectTag = Select(driver.find_element_by_name("jumpMenu"))
Select by index
Select by value
Select based on visual text
selectTag.select_by_visible_text("95 Show client)
Uncheck all options
Operation buttons: there are many ways to operate buttons. For example, click, right-click, double-click, etc. Here is one of the most commonly used. Just click. Just call the click function directly. The example code is as follows:
inputTag = driver.find_element_by_id('su') inputTag.click()
Sometimes the operation in the page may take many steps. At this time, you can use the mouse behavior chain class ActionChains to complete it. For example, now you want to move the mouse over an element and execute a click event. Then the example code is as follows:
inputTag = driver.find_element_by_id('kw') submitTag = driver.find_element_by_id('su') actions = ActionChains(driver) actions.move_to_element(inputTag) actions.send_keys_to_element(inputTag,'python') actions.move_to_element(submitTag) actions.click(submitTag) actions.perform()
There are more mouse related operations.
click_and_hold(element): click but do not release the mouse.
context_click(element): right click.
double_click(element): double click. For more methods, please refer to: http://selenium-python.readthedocs.io/api.html
Get all cookie s:
for cookie in driver.get_cookies(): print(cookie)
Get value according to the key of the cookie:
value = driver.get_cookie(key)
Delete all cookie s:
Delete a cookie:
Nowadays, more and more web pages adopt Ajax technology, so that the program cannot determine when an element is fully loaded. If the actual page waiting time is too long, resulting in a dom element not coming out, but your code directly uses this WebElement, it will throw a null pointer exception. To solve this problem. Therefore, Selenium provides two waiting methods: implicit waiting and explicit waiting.
Implicit wait: call driver implicitly_ wait. Then it will wait for 10 seconds before getting the unavailable elements. The example code is as follows:
driver = webdriver.Chrome(executable_path=driver_path) driver.implicitly_wait(10)
Request web page
Display wait: display wait indicates that the operation of obtaining elements is performed only after a condition is established. You can also specify a maximum time while waiting. If it exceeds this time, an exception will be thrown. Display waiting should use selenium webdriver. support. excepted_ Conditions expected conditions and selenium webdriver. support. ui. Webdriverwait. The example code is as follows:
from selenium import webdrive from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit()
Some other waiting conditions:
presence_of_element_located: an element has been loaded.
presence_of_all_emement_located: all qualified elements in the web page have been loaded.
element_to_be_cliable: an element can be clicked.
For more conditions, please refer to: http://selenium-python.readthedocs.io/waits.html
Sometimes there are many sub tab pages in the window. It must be necessary to switch at this time. selenium provides a tool called switch_to_window to switch. You can switch to which page from driver window_ Found in handles. The example code is as follows:
Open a new page
Switch to this new page
Set proxy ip:
Sometimes I often crawl some web pages. The server will block your ip address when it finds you are a crawler. At this time, we can change the proxy ip. Changing the proxy ip, different browsers have different implementation methods. Here, take Chrome browser as an example to explain:
from selenium import webdrive options = webdriver.ChromeOptions() options.add_argument("–proxy-server=http://126.96.36.199:8123") driver_path = r"D:\ProgramApp\chromedriver\chromedriver.exe" driver = webdriver.Chrome(executable_path=driver_path,chrome_options=options) driver.get('http://httpbin.org/ip')
from selenium.webdriver.remote.webelement import WebElement class is the class of each obtained element.
There are some common attributes:
get_attribute: the value of an attribute of this tag.
Screenshot: get a screenshot of the current page. This method can only be used on the driver.
The object class of driver is also inherited from WebElement. Author: Zhao 0_ bili https://www.bilibili.com/read/cv15524875 Source: BiliBili