I came across a man named puppeteer Google browser developed itself in 17 years Chrome Headless characteristic,And launched at the same time puppeteer,It can be understood as our daily use Chrome Interface free version of and how to manipulate it js Interface package ". In short, it is a third-party library that simulates the behavior of the browser. In addition, as long as the page accessed by the browser, it can be operated, which is extremely powerful. Think of a friend doing a school educational administration system crawler, because the school login interface is used js Rendered, the login interface will be spliced js Encrypted string, which can be generated dynamically only in the browser, js It's also Swiss number encryption. It's very expensive and difficult to crack, so I'll try to use it puppeteer To simulate login to obtain user data. Now, open the login interface and start analysis
Using the puppeter is actually how people operate. Just write it into code
1: Open URL
2: Enter student number
3: Enter password
4: Click login
The implementation code is:
const puppeteer = require('puppeteer'); const browser = await puppeteer.launch({ ignoreDefaultArgs: ["--enable-automation"], headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] }); //Remove reminders of automated testing const page = await browser.newPage(); //To fake more data like a browser await page.evaluateOnNewDocument(() => { //Execute the following script before each new page opens const newProto = navigator.__proto__; delete newProto.webdriver; //Delete navigator Webdriver field navigator.__proto__ = newProto; window.chrome = {}; //Add window For the chrome field, some values need to be filled in to increase authenticity window.chrome.app = { "InstallState": "hehe", "RunningState": "haha", "getDetails": "xixi", "getIsInstalled": "ohno" }; window.chrome.csi = function () {}; window.chrome.loadTimes = function () {}; window.chrome.runtime = function () {}; Object.defineProperty(navigator, 'userAgent', { //userAgent has the word headless in headless mode, so it needs to be overwritten get: () => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36", }); Object.defineProperty(navigator, 'plugins', { //Disguise real plug-in information get: () => [{ "description": "Portable Document Format", "filename": "internal-pdf-viewer", "length": 1, "name": "Chrome PDF Plugin" }] }); Object.defineProperty(navigator, 'languages', { //add language get: () => ["zh-CN", "zh", "en"], }); const originalQuery = window.navigator.permissions.query; //notification camouflage window.navigator.permissions.query = (parameters) => ( parameters.name === 'notifications' ? Promise.resolve({ state: Notification.permission }) : originalQuery(parameters) ); }) await page.goto('http://xxx/Login.html') await page.waitForNavigation();//Wait for loading to complete await page.type('#txtUser', "20190xxxxx"); await page.type('#txtPWD', '42052120xxxxx'); await page.click('#ibtnLogin'); await page.waitForNavigation();
Log in successfully, and then click [transcript]
const [response] = await Promise.all([ page.waitForNavigation(), page.click('a[href="SearchInfo/Score/ScoreList.aspx"]') ]); let res = await page.$eval('ul.listUl', el => el.outerHTML)//Get dom await browser.close();
If you output res, it is all the codes of this table
This thing can also take screenshots. Yes, you can't see the interface. You can also take screenshots in other ways
await page.screenshot({ path: 'c:/temp/temp.png' })
Add this paragraph to capture the screen and store it in the temp folder under drive C. if there is no folder, it will not be created automatically. You need to create it yourself in advance
Does it feel good to use it? But this thing is like a browser running in the background. If multiple requests come in, multiple tasks will be opened in the background and executed separately, but the concurrency is high. Isn't it that the server will explode. Can you put it in the cloud function? The answer is yes.
It can be run directly in the wechat applet cloud function without installing the library. The official nodejs10 version has been installed by default and can be used by direct reference
Let's look at the time-consuming and memory size
It takes 2789ms, and the memory is 135.74MB, which is completely acceptable.
Conclusion: puppeter simulation login can be used as any form of website, but the speed is a little slow
Advantages: the biggest advantage is that you can do anti crawler websites and js rendered websites
Disadvantages: the memory is a black hole. It can't cope with high concurrency on the server (it can be solved by cloud function). The speed is slow. Because it is a simulated manual operation, the input contents are value input one by one and have to wait for loading, so it is much slower than the conventional request interface method