Dynamic web page data capture

Posted by pseudonym on Sat, 05 Mar 2022 06:13:17 +0100

What is AJAX:

Ajax (asynchronous JavaScript and XML) asynchronous JavaScript and XML. Ajax can make web pages update asynchronously by exchanging a small amount of data with the server in the background. This means that a part of a web page can be updated without reloading the whole web page. Traditional web pages (without Ajax) must reload the whole web page if they need to update the content. Because the traditional transmission data format uses XML syntax. Therefore, it is called Ajax. In fact, now data interaction basically uses JSON. For the data loaded with Ajax, even if JS is used to render the data to the browser, you can't see the data loaded through Ajax in right click - > view web page source code, but only the html code loaded with this url.

How to get ajax data:

Directly analyze the interface called by ajax. Then request this interface through code.

Use Selenium+chromedriver to simulate browser behavior and obtain data.

Advantages and disadvantages of the method

The analysis interface can request data directly. There is no need to do some parsing. Less code and high performance. The analysis interface is complex, especially some interfaces confused through js. You should have a certain js foundation. Easily found as reptiles.

Selenium directly simulates the behavior of the browser. What the browser can request, you can also request using selenium. Reptiles are more stable. A lot of code. Low performance.

Selenium+chromedriver to obtain dynamic data:

Selenium is equivalent to a robot. It can simulate some human behaviors on the browser and automatically process some behaviors on the browser, such as clicking, filling in data, deleting cookie s and so on. chromedriver is a driver of Chrome browser. Only by using it can you drive the browser. Of course, there are different drivers for different browsers. The following lists the different browsers and their corresponding drivers:

Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads

Firefox: https://github.com/mozilla/geckodriver/releases

Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

For more explanations on the use of pychar activation code tutorial, see: https://vrg123.com

To install Selenium and chromedriver:

Install Selenium: Selenium is available in many languages, including java, ruby, python, etc. We can download the python version

pip install selenium

Install chromedriver: after downloading, put it into the pure English directory without permission.

Quick start:

Now let's take a simple example of getting Baidu's home page. Let's talk about how Selenium and chromedriver can get started quickly:

from selenium import webdrive

Absolute path of chromedriver

driver_path = r'D:\ProgramApp\chromedriver\chromedriver.exe'

Initialize a driver and specify the path of the chromedriver

driver = webdriver.Chrome(executable_path=driver_path)

Request web page

driver.get("https://www.baidu.com/")

Through page_source to get the web page source code

print(driver.page_source)

Common operations of selenium:

For more tutorials, please refer to: http://selenium-python.readthedocs.io/installation.html#introduction

Close page:

driver.close(): close the current page.

driver.quit(): exit the entire browser

Locate element:

find_element_by_id: find an element by id. Equivalent to:

submitTag = driver.find_element_by_id('su')
submitTag1 = driver.find_element(By.ID,'su')

find_element_by_class_name: find elements by class name. Equivalent to:

submitTag = driver.find_element_by_class_name('su')
submitTag1 = driver.find_element(By.CLASS_NAME,'su')

find_element_by_name: find the element according to the value of the name attribute. Equivalent to:

submitTag = driver.find_element_by_name('email')
submitTag1 = driver.find_element(By.NAME,'email')

find_element_by_tag_name: find elements by tag name. Equivalent to:

submitTag = driver.find_element_by_tag_name('div')
submitTag1 = driver.find_element(By.TAG_NAME,'div')

find_element_by_xpath: get elements according to xpath syntax. Equivalent to:

submitTag = driver.find_element_by_xpath('//div')
submitTag1 = driver.find_element(By.XPATH,'//div')

find_element_by_css_selector: select elements according to CSS selector. Equivalent to:

submitTag = driver.find_element_by_css_selector('//div')
submitTag1 = driver.find_element(By.CSS_SELECTOR,'//div')

Note that find_element is the first element that meets the condition. find_elements is to get all the elements that meet the conditions.

Action form elements:

Operation input box: divided into two steps. Step 1: find this element. Step 2: use send_keys(value) to fill in the data. The example code is as follows:

inputTag = driver.find_element_by_id('kw')
inputTag.send_keys('python')

Use the clear method to clear the contents of the input box. The example code is as follows:

inputTag.clear()

Operate checkbox: because you want to select the checkbox tab, you click it in the web page. Therefore, if you want to select the checkbox tag, select the tag first, and then execute the click event. The example code is as follows:

rememberTag = driver.find_element_by_name("rememberMe")
rememberTag.click()

Select: select elements cannot be clicked directly. Because you need to select elements after clicking. At this time, selenium provides a class selenium for the select tag webdriver. support. ui. Select. Pass the obtained element as a parameter to this class and create this object. You can use this object for selection later. The example code is as follows:

from selenium.webdriver.support.ui import Select

Select this tag and use select to create the object

selectTag = Select(driver.find_element_by_name("jumpMenu"))

Select by index

selectTag.select_by_index(1)

Select by value

selectTag.select_by_value("http://www.95yueba.com")

Select based on visual text

selectTag.select_by_visible_text("95 Show client)

Uncheck all options

selectTag.deselect_all()

Operation buttons: there are many ways to operate buttons. For example, click, right-click, double-click, etc. Here is one of the most commonly used. Just click. Just call the click function directly. The example code is as follows:

inputTag = driver.find_element_by_id('su')
inputTag.click()

Behavior chain:

Sometimes the operation in the page may take many steps. At this time, you can use the mouse behavior chain class ActionChains to complete it. For example, now you want to move the mouse over an element and execute a click event. Then the example code is as follows:

inputTag = driver.find_element_by_id('kw')
submitTag = driver.find_element_by_id('su')
actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag,'python')
actions.move_to_element(submitTag)
actions.click(submitTag)
actions.perform()

There are more mouse related operations.

click_and_hold(element): click but do not release the mouse.

context_click(element): right click.

double_click(element): double click. For more methods, please refer to: http://selenium-python.readthedocs.io/api.html

Cookie operation:

Get all cookie s:

for cookie in driver.get_cookies():
print(cookie)

Get value according to the key of the cookie:

value = driver.get_cookie(key)

Delete all cookie s:

driver.delete_all_cookies()

Delete a cookie:

driver.delete_cookie(key)

Page waiting:

Nowadays, more and more web pages adopt Ajax technology, so that the program cannot determine when an element is fully loaded. If the actual page waiting time is too long, resulting in a dom element not coming out, but your code directly uses this WebElement, it will throw a null pointer exception. To solve this problem. Therefore, Selenium provides two waiting methods: implicit waiting and explicit waiting.

Implicit wait: call driver implicitly_ wait. Then it will wait for 10 seconds before getting the unavailable elements. The example code is as follows:

driver = webdriver.Chrome(executable_path=driver_path)
driver.implicitly_wait(10)

Request web page

driver.get("https://www.douban.com/")

Display wait: display wait indicates that the operation of obtaining elements is performed only after a condition is established. You can also specify a maximum time while waiting. If it exceeds this time, an exception will be thrown. Display waiting should use selenium webdriver. support. excepted_ Conditions expected conditions and selenium webdriver. support. ui. Webdriverwait. The example code is as follows:

from selenium import webdrive
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()

Some other waiting conditions:

presence_of_element_located: an element has been loaded.

presence_of_all_emement_located: all qualified elements in the web page have been loaded.

element_to_be_cliable: an element can be clicked.

For more conditions, please refer to: http://selenium-python.readthedocs.io/waits.html

Switch pages:

Sometimes there are many sub tab pages in the window. It must be necessary to switch at this time. selenium provides a tool called switch_to_window to switch. You can switch to which page from driver window_ Found in handles. The example code is as follows:

Open a new page

self.driver.execute_script("window.open('"+url+"')")

Switch to this new page

self.driver.switch_to_window(self.driver.window_handles[1])

Set proxy ip:

Sometimes I often crawl some web pages. The server will block your ip address when it finds you are a crawler. At this time, we can change the proxy ip. Changing the proxy ip, different browsers have different implementation methods. Here, take Chrome browser as an example to explain:

from selenium import webdrive
options = webdriver.ChromeOptions()
options.add_argument("–proxy-server=http://110.73.2.248:8123")
driver_path = r"D:\ProgramApp\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path,chrome_options=options)
driver.get('http://httpbin.org/ip')

WebElement element:

from selenium.webdriver.remote.webelement import WebElement class is the class of each obtained element.

There are some common attributes:

get_attribute: the value of an attribute of this tag.

Screenshot: get a screenshot of the current page. This method can only be used on the driver.

The object class of driver is also inherited from WebElement. Author: Zhao 0_ bili https://www.bilibili.com/read/cv15524875 Source: BiliBili