The third-party library puppeter of nodejs is used to solve some landing page simulation problems of rendering pages with js

Posted by hughesa on Mon, 03 Jan 2022 11:44:22 +0100

	I came across a man named puppeteer Google browser developed itself in 17 years Chrome Headless characteristic,And launched at the same time puppeteer,It can be understood as our daily use Chrome Interface free version of and how to manipulate it js Interface package ". In short, it is a third-party library that simulates the behavior of the browser. In addition, as long as the page accessed by the browser, it can be operated, which is extremely powerful.
	Think of a friend doing a school educational administration system crawler, because the school login interface is used js Rendered, the login interface will be spliced js Encrypted string, which can be generated dynamically only in the browser, js It's also Swiss number encryption. It's very expensive and difficult to crack, so I'll try to use it puppeteer To simulate login to obtain user data.
	Now, open the login interface and start analysis


Using the puppeter is actually how people operate. Just write it into code
1: Open URL
2: Enter student number
3: Enter password
4: Click login
The implementation code is:

  const puppeteer = require('puppeteer');
  const browser = await puppeteer.launch({
    ignoreDefaultArgs: ["--enable-automation"],
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  }); //Remove reminders of automated testing

  const page = await browser.newPage();
  //To fake more data like a browser
  await page.evaluateOnNewDocument(() => { //Execute the following script before each new page opens
    const newProto = navigator.__proto__;
    delete newProto.webdriver; //Delete navigator Webdriver field
    navigator.__proto__ = newProto;
    window.chrome = {}; //Add window For the chrome field, some values need to be filled in to increase authenticity
    window.chrome.app = {
      "InstallState": "hehe",
      "RunningState": "haha",
      "getDetails": "xixi",
      "getIsInstalled": "ohno"
    };
    window.chrome.csi = function () {};
    window.chrome.loadTimes = function () {};
    window.chrome.runtime = function () {};
    Object.defineProperty(navigator, 'userAgent', { //userAgent has the word headless in headless mode, so it needs to be overwritten
      get: () => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36",
    });
    Object.defineProperty(navigator, 'plugins', { //Disguise real plug-in information
      get: () => [{
        "description": "Portable Document Format",
        "filename": "internal-pdf-viewer",
        "length": 1,
        "name": "Chrome PDF Plugin"
      }]
    });
    Object.defineProperty(navigator, 'languages', { //add language
      get: () => ["zh-CN", "zh", "en"],
    });
    const originalQuery = window.navigator.permissions.query; //notification camouflage
    window.navigator.permissions.query = (parameters) => (
      parameters.name === 'notifications' ?
      Promise.resolve({
        state: Notification.permission
      }) :
      originalQuery(parameters)
    );
  })
  await page.goto('http://xxx/Login.html')
  await page.waitForNavigation();//Wait for loading to complete
  await page.type('#txtUser', "20190xxxxx");
  await page.type('#txtPWD', '42052120xxxxx');
  await page.click('#ibtnLogin');
  await page.waitForNavigation();
  


Log in successfully, and then click [transcript]

  const [response] = await Promise.all([
    page.waitForNavigation(),
    page.click('a[href="SearchInfo/Score/ScoreList.aspx"]')
  ]);
  let res = await page.$eval('ul.listUl', el => el.outerHTML)//Get dom
  await browser.close();

If you output res, it is all the codes of this table

This thing can also take screenshots. Yes, you can't see the interface. You can also take screenshots in other ways

    await page.screenshot({

        path: 'c:/temp/temp.png'

    })

Add this paragraph to capture the screen and store it in the temp folder under drive C. if there is no folder, it will not be created automatically. You need to create it yourself in advance

Does it feel good to use it? But this thing is like a browser running in the background. If multiple requests come in, multiple tasks will be opened in the background and executed separately, but the concurrency is high. Isn't it that the server will explode. Can you put it in the cloud function? The answer is yes.

It can be run directly in the wechat applet cloud function without installing the library. The official nodejs10 version has been installed by default and can be used by direct reference

Let's look at the time-consuming and memory size

It takes 2789ms, and the memory is 135.74MB, which is completely acceptable.

Conclusion: puppeter simulation login can be used as any form of website, but the speed is a little slow

Advantages: the biggest advantage is that you can do anti crawler websites and js rendered websites
Disadvantages: the memory is a black hole. It can't cope with high concurrency on the server (it can be solved by cloud function). The speed is slow. Because it is a simulated manual operation, the input contents are value input one by one and have to wait for loading, so it is much slower than the conventional request interface method

Topics: Javascript node.js crawler