The use of selenium in reptiles

Posted by Eskimo887 on Thu, 03 Mar 2022 09:21:58 +0100

The use of selenium in reptiles

I selenium overview

1.1 definitions

Selenium is a Web automated testing tool, which was originally developed for website automated testing. Selenium can directly call the browser. It supports all mainstream browsers (including PhantomJS, which has no interface). It can receive instructions and let the browser automatically load pages, obtain required data, and even screen shots. We can easily write crawlers using selenium.

1.2 function and working principle

Using the browser's native API, it is encapsulated into a set of more object-oriented Selenium WebDriver API to directly operate the elements in the browser page and even the browser itself (screen capture, window size, start, close, install plug-ins, configure certificates, etc.)

II selenium installation

2.1 installing selenium module in python

pip install selenium

2.2 download webdriver (take edge browser as an example)

Check the version of edge browser, and you can see that my version here is 98.0.1108.62

webdriver has different browser download addresses. edge driver is used here

chrome driver: http://chromedriver.storage.googleapis.com/index.html

Firefox driver: https://github.com/mozilla/geckodriver/releases/

Edge drive: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Find the corresponding version and click x64 to download it

2.3 configuration environment

Download the msedgedriver The EXE file is placed in the directory of the edge browser

Set msedgedriver The path of the directory where the EXE file is located is added to the environment variable

2.4 simple application

Use python to open the browser and search for python in Baidu

import time
from selenium import webdriver
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('http://www.baidu.com')
time.sleep(5)
#Search for python in Baidu search box
driver.find_element_by_id('kw').send_keys('python')
driver.find_element_by_id('su').click()
time.sleep(6)
driver.quit()

III Object driver

The above example probably has the basic use of webdriver. Now create a driver object

from selenium import webdriver
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')

3.1 common attributes and methods of driver object

In the process of using selenium, after instantiating the driver object, the driver object has some common properties and methods

  1. driver.page_source the web page source code rendered by the current tab browser (does not involve the source code of capturing packets, just copy it)
  2. driver.current_url the url of the current tab (of the response)
  3. driver.close() closes the current tab. If there is only one tab, close the entire browser
  4. driver.quit() closes the browser
  5. driver. Forward page
  6. driver. Back page
  7. driver. save_ Screenshot of screenshot (img_name) page

3.2 positioning label elements

The element can be positioned in the tag and returned in a variety of ways

find_element_by_id (Returns an element)

find_element(s)_by_class_name (Get the element list according to the class name)

find_element(s)_by_name (According to the label name Property value returns a list of elements that contain label objects)

find_element(s)_by_xpath (Returns a list containing elements)

find_element(s)_by_link_text (Get the element list according to the link text)

find_element(s)_by_partial_link_text (Get the list of elements based on the text contained in the link)

find_element(s)_by_tag_name (Get the element list according to the tag name)

find_element(s)_by_css_selector (according to css Selector to get the list of elements)

be careful:

  • find_element and find_ Differences between elements:
    • If there is more than one s, the list will be returned, and if there is no s, the first tag object matched will be returned
    • find_ Throw an exception if the element does not match, find_ If the elements do not match, an empty list will be returned
  • by_link_text and by_ partial_ link_ The difference between text: all text and containing a text

3.3 extract text content

find_element can only get the element, not the data directly. If you need to get the data, you need to use the following methods

  • Click on the element click()
    • Click the located label object
  • Enter data element into the input box send_ keys(data)
    • Enter data for the anchored label object
  • Get the text element text
    • Get the text content by locating the text attribute of the obtained label object
  • Get the attribute value element get_ Attribute ("attribute name")
    • Get of tag object obtained by locating_ The attribute function passes in the attribute name to get the value of the attribute

Chestnuts:

Take 58 city – > co renting as an example, grab all the house names and links on the current page

import time
from selenium import webdriver
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('https://hz.58.com/hezu/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d100000-0004-f0eb-30f2-60c6680bc9c2&ClickID=2')
a_list=driver.find_elements_by_xpath('/html/body/div[6]/div[2]/ul/li/div[2]/h2/a')
for a in a_list:
    print(a.text,a.get_attribute('href'))
driver.quit()

Operation results:

IV selenium tab switching

4.1 page tab switching

When selenium controls the browser to open multiple tabs, how to control the browser to switch between different tabs? need
We do the following two steps:

  • Gets the window handle of all tabs

  • Use the window handle word to switch to the tab pointed to by the handle

  • The window handle here refers to the identification pointing to the tab object

  • Specific method

    # 1. Get the list of handles of all current tabs
    current_windows = driver.window_handles
    
    # 2. Switch according to the index subscript of the tab handle list
    driver.switch_to.window(current_windows[0])
    

example:

Open two pages and switch to the second

import time
from selenium import webdriver
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('http://www.baidu.com')
time.sleep(3)
driver.find_element_by_id('kw').send_keys('python')
driver.find_element_by_id('su').click()
#Open a tag by executing js
js='window.open("https://www.sogou.com")'
# Execute js code
driver.execute_script(js)
#Get all windows
windows=driver.window_handles
time.sleep(3)
#Switch according to window index
driver.switch_to.window(windows[0])
time.sleep(3)
driver.switch_to.window(windows[1])
time.sleep(3)
driver.quit()

Secondary 58 climbing

import time
from selenium import webdriver
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('http://www.58.com')
time.sleep(1)
# Click the sharing link
driver.find_element_by_xpath('/html/body/div[3]/div[1]/div[1]/div/div[1]/div[1]/span[3]/a').click()
time.sleep(1)
#switch windows
driver.switch_to.window(driver.window_handles[-1])
time.sleep(1)
#Find the object list through xpath
a_list=driver.find_elements_by_xpath('/html/body/div[6]/div[2]/ul/li/div[2]/h2/a')
for a in a_list:
    print(a.text,a.get_attribute('href'))
driver.quit()

4.2 iframe label switching

frame is a common technology in html, that is, a page is nested with another web page, and selenium cannot access it by default
The corresponding solution to the content in iframe is

Find the iframe object frame through xpath_ Element (xpath can't see the content)

Via driver switch_ to. Frame (frame_element) can enter the iframe page

Find the connection of "account password login" through id and click it to enter the "account password login page"

Find the input box of account and password and the login button through id

Example: log in to QQ space

import time
from selenium import webdriver
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('http://i.qq.com')
time.sleep(1)
l_iframe=driver.find_element_by_xpath('//*[@id="login_frame"]')
print(l_iframe)

#Enter ifram tab
driver.switch_to.frame(l_iframe)
driver.find_element_by_id('switcher_plogin').click()
time.sleep(1)
#Enter account number
driver.find_element_by_id('u').send_keys('Yours QQ account number')
time.sleep(1)
#Input password
driver.find_element_by_id('p').send_keys('*Your password')
time.sleep(1)
driver.find_element_by_id('login_button').click()
time.sleep(10)
#Slider label, drag it manually and try again
driver.find_element_by_id('login_button').click()
driver.quit()

V Get cookies

5.1 cookie overview

Cookie s, sometimes in the plural form, refer to Cookies stored in some websites in order to identify users and track session s
Data on the user's local terminal (usually encrypted). The specifications defined in RFC2109 and 2965 have been abandoned, and the latest replaced specification is RFC6265.

The Cookie is generated by the server and sent to the user agent (usually the browser), and the browser will protect the key/value of the Cookie
Stored in a text file in a directory, the Cookie will be sent to the server the next time the same website is requested (provided that the browser is set to enable)
Use cookies). The cookie name and value can be defined by the server-side development. For JSP, jsessionid can also be written directly,
In this way, the server can know whether the user is a legal user and whether it needs to log in again.

purpose
The server can use the arbitrariness of the information contained in Cookies to filter and regularly maintain these information to judge the error in HTTP transmission
Status. The most typical application of Cookies is to determine whether the registered user has logged in to the website. The user may be prompted whether to log in next time
When you enter this website, keep user information to simplify login procedures. These are the functions of Cookies. Another important application is shopping
"Car" and so on. Users may choose different products from different pages of the same website over a period of time, and these information will be written in
Cookies to extract information at the time of final payment.

Get cookie s

driver.get_cookies() returns a list containing complete cookie information! Not only name and value, but also
There are cookies such as domain and other dimensions of information. So if you want to use the obtained cookie information with the requests module
If yes, you need to convert the cookie dictionary into name and value as key value pairs

# Get all cookie information of the current tab
print(driver.get_cookies())
# Convert cookie s into Dictionaries
cookies_dict = {cookie['name']: cookie['value']} for cookie in
driver.get_cookies()}

5.3 deleting cookie s

#Delete a cookie
driver.delete_cookie("CookieName")
# Delete all cookie s
driver.delete_all_cookies()

Vi Page waiting

6.1 forced waiting

  • It's actually time sleep()
  • Disadvantages: it is not intelligent, the setting time is too short, and the elements have not been loaded; If the setting time is too long, it will waste time

6.2 implicit waiting

  • Implicit waiting is for element positioning. Implicit waiting sets a time to judge whether the element is successfully positioned within a period of time, such as
    If you are finished, go to the next step
  • If the location is not successful within the set time, the timeout loading will be reported
  • Sample code
from selenium import webdriver
driver = webdriver.Chrome()
driver.implicitly_wait(10) # Implicit wait, max. 10 seconds
driver.get('https://www.baidu.com')
driver.find_element_by_xpath()

6.3 display waiting

Check whether the waiting conditions are met every few seconds. If they are met, stop waiting and continue to execute subsequent codes

If there is no achievement, continue to wait until the specified time is exceeded, and a timeout exception will be reported

import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
#Create drive object
#To register environment variables
# driver=webdriver.Edge()
#There is no registered executable_ The path where the driver is located in path
driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')
driver.get('http://baidu.com')
WebDriverWait(driver,20,0.5).until(
    ec.presence_of_element_located((By.LINK_TEXT,'hao123'))
)
print(driver.find_element_by_link_text('hao123').get_attribute('href'))
driver.quit()

VII selenium anti climbing

7.1 using proxy ip

Method of using proxy ip

#Instantiate configuration object
options = webdriver.ChromeOptions()
#The configuration object adds commands that use proxy ip
options.add_argument('--proxy-server=http://202.20.16.82:9527')
Instantiate with configuration object driver object
#driver = webdriver.Edge(chrome_options=options)

7.2 replacing user agent

#Instantiate configuration object
options = webdriver.ChromeOptions()
#Add and replace UA command for configuration object
options.add_argument('--user-agent=Mozilla/5.0 HAHA')
#Instantiate the driver object with the configuration object
driver = webdriver.Edge(chrome_options=options)

7.3 fishing net climbing

import time
from selenium.webdriver.common.by import By
from selenium import webdriver
class DouYuSpider(object):
    def __init__(self):
        self.url='https://www.douyu.com/directory/all'
        # self.opt=webdriver.EdgeOptions().add_argument('--headless')
        self.driver=webdriver.Edge(executable_path='C:\Program Files (x86)\Microsoft\Edge\Application/msedgedriver.exe')

    def parse_data(self):
        time.sleep(10)
        #Get the list of live rooms
        room_list=self.driver.find_elements_by_xpath('//*[@id="listAll"]/section[2]/div[2]/ul/li/div')
        print(len(room_list))
        data_list=[]#List of all data

        #Traverse room_list grabs the data of each live broadcast room
        for r in room_list:
            temp={}
            temp['title']=r.find_element_by_xpath('./a/div[2]/div[1]/h3').text
            temp['type']=r.find_element_by_xpath('./a/div[2]/div[1]/span').text
            temp['owner']=r.find_element_by_xpath('./a/div[2]/div[2]/h2/div').text
            temp['num']=r.find_element_by_xpath('./a/div[2]/div[2]/span').text
            #There is a problem here. You can't grab it with the comments below
            temp['image']=self.driver.find_element_by_xpath('//*[@id="listAll"]/section[2]/div[2]/ul/li/div/a/div[1]/div[1]/picture/img').get_attribute('src')
            # temp['image'] = r.find_element_by_xpath('./a/div[1]/div[1]/picture/img').get_attribute('src')
            data_list.append(temp)
        return  data_list,len(room_list)

    def save_data(self,data_list):
        for data in data_list:
            print(data)

    def main(self):
        # the number of pages
        number=0
        self.driver.get(self.url)
        while True:
            data_list,num=self.parse_data()
            self.save_data(data_list)
            if num<120:
                break
            try:
                self.driver.execute_script('scrollTo(0,1000000)')
                # el_text=self.driver.find_element_by_xpath('//*[@id="listAll"]/section[2]/div[2]/div/ul/li[9]/span')
                #It can also be written like this
                el_text=self.driver.find_element(by=By.XPATH,value='//*[contains(text(), "next")] ')
                el_text.click()
                number+=1
                print(f'The first{number}page')
            except Exception as e:
                print(e)
                break
        self.driver.quit()
if __name__ == '__main__':
    spider=DouYuSpider()
    spider.main()

VIII No interface mode

Using selenium will call the browser to run. If you don't want the browser to hang out, you can add the following configuration

Instantiate configuration object
options = webdriver.EdgeOptions()
The configuration object adds a command to turn on the no interface mode
options.add_argument("--headless")
Usually one is enough
 Configuration object addition disabled gpu Command of
options.add_argument("--disable-gpu")
Instantiate with configuration object driver object
driver = webdriver.Edge(options=options)

Note: the 59 + version of chrome browser in macos and the 57 + version in Linux can only use the interface free mode!

Topics: Python crawler