(to be supplemented later)
Page basic operation
According to the official website document, the page loading process after calling page.goto(url):
- Set url
- Load the parsing page through the network
- The page.on("domcontentloaded") event is triggered
- Execute the js script of the page and load the static resources
- The page.on("laod") event is triggered
- The page executes dynamically loaded scripts
- When there is no new network request within 500ms, the networkidle event is triggered
page.goto(url) Will jump to a new link. By default, Playwright waits until the load state. If we don't care about the loaded CSS images and other information, we can wait to the domcontentloaded state instead. If the page is loaded by ajax, we need to wait to the networkidle state. If networkidle is not suitable, page.wait can be used_ for_ The selector waits for an element to appear. However, for click and other operations, it will automatically wait.
page.goto(url, referer="", timeout=30, wait_until="domcontentloaded|load|networkidle")
Playwright automatically waits for the element to be in an operable stable state. Of course, it can also be used page.wait_for_* Function to wait manually:
page.wait_for_event("event", event_predict, timeout) page.wait_for_function(js_function) page.wait_for_load_state(state="domcontentloaded|load|networkidle", timeout) page.wait_for_selector(selector, timeout) page.wait_for_timeout(timeout) # Not recommended
The operation methods of the page mainly include:
# selector refers to expressions such as CSS page.click(selector) page.fill(selector, value) # Fill in value in input # example page.click("#search")
The main methods to obtain the data in the page are:
page.url # url page.title() # title page.content() # Get page full text page.inner_text(selector) # element.inner_text() page.inner_html(selector) page.text_content(selector) page.get_attribute(selector, attr) # eval_ on_ The selector is used to get the value in the DOM page.eval_on_selector(selector, js_expression) # For example: search_value = page.eval_on_selector("#search", "el => el.value") # evaluate is used to get the data in JS in the page, for example, you can read the value in window result = page.evaluate("([x, y]) => Promise.resolve(x * y)", [7, 8]) print(result) # prints "56"
selector expression
In the above code, we use CSS expressions (such as #button) to select elements. In fact, Playwright also supports XPath and two simple expressions defined by itself, which are recognized automatically.
# Select elements through text, which is an expression customized by Playwright page.click("text=login") # Select directly by id page.click("id=login") # Select elements through CSS page.click("#search") # In addition to the commonly used CSS expressions, Playwright also supports several new pseudo classes # : has indicates an element that contains an element page.click("article:has(div.prome)") # : is used to assert itself page.click("button:is(:text('sign in'), :text('log in'))") # : text indicates the element containing a text page.click("button:text('Sign in')") # contain page.click("button:text-is('Sign is')") # Strict matching page.click("button:text-matches('\w+')") # regular # You can also match according to the orientation page.click("button:right-of(#search)") # right page.click("button:left-of(#search)") # left page.click("button:above(#search)") # Above page.click("button:below(#search)") # Below page.click("button:near(#search)") # Elements within 50px # Select through XPath page.click("//button[@id='search'])") # All expressions beginning with / / or.. will default to XPath expressions
For CSS expressions, you can also add the prefix CSS = to explicitly specify, for example css=.login It's equivalent to . login.
In addition to the four expressions described above, Playwright supports the use of >> Combining expressions, that is, mixing four expressions.
page.click('css=nav >> text=Login')
Reuse authentication information such as Cookies
In puppeter, reusing Cookies is also a long-standing problem. This is particularly convenient for Playwright. It can directly export Cookies and LocalStorage and then use them in a new Context.
# Save status import json storage = context.storage_state() with open("state.json", "w") as f: f.write(json.dumps(storage)) # Loading status with open("state.json") as f: storage_state = json.loads(f.read()) context = browser.new_context(storage_state=storage_state)
Listening events
You can register the handler function of the corresponding event through page.on(event, fn):
def log_request(intercepted_request): print("a request was made:", intercepted_request.url) page.on("request", log_request) # sometime later... page.remove_listener("request", log_request)
The more important events are request and response
Block change network requests
You can listen for requests and response events through page.on("request") and page.on("response").
from playwright.sync_api import sync_playwright as playwright def run(pw): browser = pw.webkit.launch() page = browser.new_page() # Subscribe to "request" and "response" events. page.on("request", lambda request: print(">>", request.method, request.url)) page.on("response", lambda response: print("<<", response.status, response.url)) page.goto("https://example.com") browser.close() with playwright() as pw: run(pw)
For the properties and methods of request and response, you can refer to the document: https://playwright.dev/python/docs/api/class-request
Through context.route, you can also forge and modify interception requests. For example, block all picture requests to reduce bandwidth usage:
context = browser.new_context() page = context.new_page() # The parameters of route are wildcards by default, and can also pass compiled regular expression objects context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort()) context.route(re.compile(r"(\.png$)|(\.jpg$)"), lambda route: route.abort()) page.goto("https://example.com") browser.close()
For the related attributes and methods of the route object, you can refer to the document: https://playwright.dev/python/docs/api/class-route
Flexible proxy settings
Playwright also makes it easy to set up agents. The puppeter cannot change the proxy after opening the browser. It is very unfriendly to crawler applications. Playwright can set the proxy through Context, which is very lightweight and does not need to restart the browser to switch the proxy.
context = browser.new_context( proxy={"server": "http://example.com:3128", "bypass": ".example.com", "username": "", "password": ""} )
Killer function: recording operation directly generates code
Playwright's command line also has an interesting built-in function: you can directly generate Python code by recording your click operation.
python -m playwright codegen http://example.com/
Playwright also has many command-line functions, such as generating screenshots and so on python -m playwright -h see.
other
In addition, Playwright also supports various functions such as handling pop-up windows, simulating keyboard, simulating mouse drag (for sliding verification code), downloading files, etc. Please check the official documents, which will not be repeated here. For the writing crawler, several features of Playwright can be said to be second kill puppeter / pyppeteer:
- Official synchronized version of API
- Easy to import and export Cookies
- Lightweight setup and switching agents
- Support rich selection expressions
reference resources:
Playwright: a better browser automation tool than puppeter - know