[PlayWright tutorial] summary of basic operations

Posted by mariolopes on Sat, 18 Sep 2021 00:35:38 +0200

(to be supplemented later)

Page basic operation

According to the official website document, the page loading process after calling page.goto(url):

Set url
Load the parsing page through the network
The page.on("domcontentloaded") event is triggered
Execute the js script of the page and load the static resources
The page.on("laod") event is triggered
The page executes dynamically loaded scripts
When there is no new network request within 500ms, the networkidle event is triggered

page.goto(url) Will jump to a new link. By default, Playwright waits until the load state. If we don't care about the loaded CSS images and other information, we can wait to the domcontentloaded state instead. If the page is loaded by ajax, we need to wait to the networkidle state. If networkidle is not suitable, page.wait can be used_ for_ The selector waits for an element to appear. However, for click and other operations, it will automatically wait.

page.goto(url, referer="", timeout=30, wait_until="domcontentloaded|load|networkidle")

Playwright automatically waits for the element to be in an operable stable state. Of course, it can also be used page.wait_for_* Function to wait manually:

page.wait_for_event("event", event_predict, timeout)
page.wait_for_function(js_function)
page.wait_for_load_state(state="domcontentloaded|load|networkidle", timeout)
page.wait_for_selector(selector, timeout)
page.wait_for_timeout(timeout)  # Not recommended

The operation methods of the page mainly include:

# selector refers to expressions such as CSS
page.click(selector)
page.fill(selector, value)  # Fill in value in input

# example
page.click("#search")

The main methods to obtain the data in the page are:

page.url  # url
page.title()  # title
page.content()  # Get page full text
page.inner_text(selector)  # element.inner_text()
page.inner_html(selector)
page.text_content(selector)
page.get_attribute(selector, attr)

# eval_ on_ The selector is used to get the value in the DOM
page.eval_on_selector(selector, js_expression)
# For example:
search_value = page.eval_on_selector("#search", "el => el.value")

# evaluate is used to get the data in JS in the page, for example, you can read the value in window
result = page.evaluate("([x, y]) => Promise.resolve(x * y)", [7, 8])
print(result) # prints "56"

selector expression

In the above code, we use CSS expressions (such as #button) to select elements. In fact, Playwright also supports XPath and two simple expressions defined by itself, which are recognized automatically.

# Select elements through text, which is an expression customized by Playwright
page.click("text=login")

# Select directly by id
page.click("id=login")

# Select elements through CSS
page.click("#search")
# In addition to the commonly used CSS expressions, Playwright also supports several new pseudo classes
# : has indicates an element that contains an element
page.click("article:has(div.prome)")
# : is used to assert itself
page.click("button:is(:text('sign in'), :text('log in'))")
# : text indicates the element containing a text
page.click("button:text('Sign in')")  # contain
page.click("button:text-is('Sign is')")  # Strict matching
page.click("button:text-matches('\w+')")  # regular
# You can also match according to the orientation
page.click("button:right-of(#search)")  # right
page.click("button:left-of(#search)")  # left
page.click("button:above(#search)")  # Above
page.click("button:below(#search)")  # Below
page.click("button:near(#search)")  # Elements within 50px

# Select through XPath
page.click("//button[@id='search'])")
# All expressions beginning with / / or.. will default to XPath expressions

For CSS expressions, you can also add the prefix CSS = to explicitly specify, for example css=.login It's equivalent to . login.

In addition to the four expressions described above, Playwright supports the use of >> Combining expressions, that is, mixing four expressions.

page.click('css=nav >> text=Login')

Reuse authentication information such as Cookies

In puppeter, reusing Cookies is also a long-standing problem. This is particularly convenient for Playwright. It can directly export Cookies and LocalStorage and then use them in a new Context.

# Save status
import json
storage = context.storage_state()
with open("state.json", "w") as f:
    f.write(json.dumps(storage))

# Loading status
with open("state.json") as f:
    storage_state = json.loads(f.read())
context = browser.new_context(storage_state=storage_state)

Listening events

You can register the handler function of the corresponding event through page.on(event, fn):

def log_request(intercepted_request):
    print("a request was made:", intercepted_request.url)
page.on("request", log_request)
# sometime later...
page.remove_listener("request", log_request)

The more important events are request and response

Block change network requests

You can listen for requests and response events through page.on("request") and page.on("response").

from playwright.sync_api import sync_playwright as playwright

def run(pw):
    browser = pw.webkit.launch()
    page = browser.new_page()
    # Subscribe to "request" and "response" events.
    page.on("request", lambda request: print(">>", request.method, request.url))
    page.on("response", lambda response: print("<<", response.status, response.url))
    page.goto("https://example.com")
    browser.close()

with playwright() as pw:
    run(pw)

For the properties and methods of request and response, you can refer to the document: https://playwright.dev/python/docs/api/class-request

Through context.route, you can also forge and modify interception requests. For example, block all picture requests to reduce bandwidth usage:

context = browser.new_context()
page = context.new_page()
# The parameters of route are wildcards by default, and can also pass compiled regular expression objects
context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
context.route(re.compile(r"(\.png$)|(\.jpg$)"), lambda route: route.abort())
page.goto("https://example.com")
browser.close()

For the related attributes and methods of the route object, you can refer to the document: https://playwright.dev/python/docs/api/class-route

Flexible proxy settings

Playwright also makes it easy to set up agents. The puppeter cannot change the proxy after opening the browser. It is very unfriendly to crawler applications. Playwright can set the proxy through Context, which is very lightweight and does not need to restart the browser to switch the proxy.

context = browser.new_context(
    proxy={"server": "http://example.com:3128", "bypass": ".example.com", "username": "", "password": ""}
)

Killer function: recording operation directly generates code

Playwright's command line also has an interesting built-in function: you can directly generate Python code by recording your click operation.

python -m playwright codegen http://example.com/

Playwright also has many command-line functions, such as generating screenshots and so on python -m playwright -h see.

other

In addition, Playwright also supports various functions such as handling pop-up windows, simulating keyboard, simulating mouse drag (for sliding verification code), downloading files, etc. Please check the official documents, which will not be repeated here. For the writing crawler, several features of Playwright can be said to be second kill puppeter / pyppeteer:

Official synchronized version of API
Easy to import and export Cookies
Lightweight setup and switching agents
Support rich selection expressions

reference resources:

Playwright: a better browser automation tool than puppeter - know

Topics: Javascript Web Development css Ajax

Programmer Think