One technology a day: how to replace the proxy IP without restarting the puppeter

Posted by tmyonline on Wed, 22 Dec 2021 16:53:25 +0100

We know that in the process of writing crawlers, if you always use the same IP, it will be easily recognized and blocked by the website, so you need to use the proxy IP and change it frequently.

However, if you search online for how to replace the proxy IP of puppeter, you will find that the online solution is generally written as follows:

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch({
    args: [ '--proxy-server=123.45.67.89:8888' ]
  });
  const page = await browser.newPage();
  await page.goto('http://httpbin.org/ip');
  await browser.close();
})();

There is a problem with this writing method. If you want to change the IP, you must restart the crawler. So is there a way to replace the proxy IP without restarting the crawler?

There are, and there are two.

Tunnel agent

For some websites, as long as the IP of each visit is different, we can avoid being blocked, so we can use tunnel proxy. The tunnel agent provider will provide us with a unique domain name and port. We can set it as the agent of the crawler. The agent supplier will automatically change the IP for each request on the back end, so we don't have to worry about it.

We use the tunnel agent of green fruit cloud [1] for demonstration. It can be tried for free for 2 hours. The proxy IP address I obtained is: http://D5A913AF:B1DE2C46D321@tunnel.qg.net:11151. Therefore, I can modify the IP address in the above puppeter Code:

const puppeteer = require('puppeteer-core');

(async() => {
  const browser = await puppeteer.launch({
    args: [ '--proxy-server=tunnel.qg.net:11151' ], headless: false, executablePath: '/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge'});
  const page = await browser.newPage();
  await page.authenticate({username: 'account number', password: 'password'});  // If the agent does not have permission to authenticate, you can remove this row
  response = await page.goto('http://httpbin.org/ip');
  console.log('First visit: ', await response.text());
  response = await page.goto('http://httpbin.org/ip');
  console.log('Second visit: ', await response.text());
})()

The operation effect is shown in the figure below:

Dynamically modify proxy IP on demand

IP is not changed as frequently as possible. If the website needs to log in, you will change the IP every time you log in, which will be self defeating and make the website more doubt whether you are a crawler. There are also some websites, such as Taobao. When you visit a page, it will automatically 301 jump multiple times. In these jumps, you must keep the IP consistent, otherwise it will shield you.

Sometimes we need to change the proxy IP on demand -- let developers change it when they need to change the IP.

In order for the puppeter to achieve this goal, we can install a third-party module: puppeter page proxy:

npm i puppeteer-page-proxy

After installation, let's take a look at:

const puppeteer = require('puppeteer-core')
const useProxy = require('puppeteer-page-proxy')


puppeteer.launch({headless: false, executablePath: '/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge'}).then(
    async browser => {
        console.log('start...')
        const page = await browser.newPage()

        await useProxy(page, 'http://Account No.: password @ 119.5 228.105:21477')
        console.log('change proxy success, start to visit url')
        resp = await page.goto('http://httpbin.org/ip')
        console.log(await resp.text())

        await useProxy(page, 'http://Account No.: password @ 119.41 199.19:56214')
        console.log('change proxy success, start to visit url')
        resp = await page.goto('http://httpbin.org/ip')
        console.log(await resp.text())
        }
)

The operation effect is shown in the figure below:

When we need to replace the IP, we only need to execute await useProxy(page, 'http: / / Account: password @ IP: port') in the code to replace the new IP. If your proxy IP does not have an account password, you can change the code to: await useProxy(page, 'http://IP: Port ').

Some people may ask, in the example code above, you directly fill in the agent in the code. What if I need to access a URL to get a new proxy? In fact, this is also very simple. You can install a third-party module: axios is used to initiate a network request to obtain a new proxy IP, and then replace it:

npm i axios

Take the short acting proxy IP of green fruit cloud as an example. It can provide an interface. After accessing the interface, you can get a short acting IP valid for 5-15 minutes, as shown in the figure below:

After opening the trial account, you can get a URL to extract the agent, similar to the following:

https://proxy.qg.net/extract?Key=ABCDEFGH&Num=1&AreaId=&Isp=&DataFormat=txt&DataSeparator=%5Cn&Detail=0

After accessing, you can get the proxy IP, as shown in the figure below:

Now, in the puppeter, first access the URL to get the proxy, then set the proxy IP to the puppeter, and then access the target web page. The corresponding codes are as follows:

const puppeteer = require('puppeteer-extra')
const useProxy = require('puppeteer-page-proxy')
const axios = require('axios')

async function set_proxy(page){
    resp = await axios.get('https://proxy.qg.net/extract?Key=ABCDEFG&Num=1&AreaId=&Isp=&DataFormat=txt&DataSeparator=%5Cn&Detail=0')
    proxy = 'http://Account number: password @ '+ resp data
    console.log('Obtained proxy IP Is:', proxy)
    await useProxy(page, proxy)
}

puppeteer.launch({headless: false, executablePath: '/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge'}).then(
    async browser => {
        console.log('start...')
        const [page] = await browser.pages()
        await set_proxy(page)

        console.log('change proxy success, start to visit url')
        resp = await page.goto('http://httpbin.org/ip')
        console.log(await resp.text())
}
)

The operation effect is shown in the figure below:

reference

[1] Green fruit cloud: https://www.qg.net/business/proxyip/42.html