Amazon anti crawler? Take you step by step to analyze and cross Amazon's anti crawler mechanism

Posted by NateDawg on Sat, 29 Jan 2022 22:42:28 +0100

Hello, I'm lex. I like bullying Superman. Lex

Areas of expertise: python development, network security penetration, Windows domain controlled Exchange architecture

Today's focus: analyze and cross Amazon's anti crawler mechanism step by step

it happened like this

Amazon is the world's largest shopping platform

Many commodity information, user evaluation and so on are the most abundant.

Today, hand in hand to take you across Amazon's anti crawler mechanism

Crawl for the products, reviews and other useful information you want

Anti reptile mechanism

However, when we want to use crawlers to crawl relevant data information

Large shopping malls like Amazon, TBao and JD

In order to protect their data information, they all have a perfect anti crawler mechanism

Try Amazon's anti crawl mechanism first

We use several different python crawler modules to test step by step

Finally, the anti climbing mechanism was successfully crossed.

1, urllib module

The code is as follows:

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)

Return result: status code: 503.

Analysis: Amazon identifies your request as a crawler and refuses to provide service.

With a scientific and rigorous attitude, let's try Baidu with 10000 people.

Return result: status code 200

Analysis: normal access

That shows that the request of urlib module is recognized as a crawler by Amazon and refuses to provide service

2, requests module

1. requests direct crawler access

The effect is as follows ↓↓

The code is as follows ↓↓

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)

Return result: status code: 503.

Analysis: Amazon also rejected the request of requses module

Identify it as a crawler and refuse to provide service.

2. We add cookie s to requests

Add relevant information such as request cookie s

The effect is as follows ↓↓

The code is as follows ↓↓

import requests

url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'Yours cookie value',
'TE': 'Trailers'}
r = requests.get(url,headers=web_header)
print(r.status_code)

Return result: status code: 200

Analysis: the returned status code is 200. It's normal. It smells like a reptile.

3. Check back page

Through the method of requests + cookies, we get the status code of 200

At present, at least Amazon's servers are providing services normally

We write the crawled page into the text and open it through the browser.

I step on a horse The return status is normal, but an anti crawler verification code page is returned.

Still blocked by Amazon.

3, selenium automation module

Installation of relevant selenium modules

pip install selenium

Introduce selenium into the code and set relevant parameters

import os
from requests.api import options
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#selenium configuration parameters
options = Options()
#Configure headless parameters, that is, do not open the browser
options.add_argument('--headless')
#Configure selenium driver of Chrome browser 
chromedriver="C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
#Combine parameter setting + browser drive
browser = webdriver.Chrome(chromedriver,chrome_options=options)

Test access

url = "https://www.amazon.com"
print(url)
#Visit Amazon through selenium
browser.get(url)

Return result: status code: 200

Analysis: the returned status code is 200, and the access status is normal. Let's take a look at the information of the web page we climbed to.

Save the web page source code locally

#Write the crawled web page information to the local file
fw=open('E:/amzon.html','w',encoding='utf-8')
fw.write(str(browser.page_source))
browser.close()
fw.close()

Open the local file we crawled and view it,

We have successfully crossed the anti crawler mechanism and entered the home page of Amazon

ending

Through selenium module, we can successfully cross

Amazon's anti crawler mechanism.

Next: let's continue to introduce how to crawl hundreds of thousands of Amazon Product information and comments.

[if you have any questions, please leave a message ~ ~]

Programmer Think