Amazon anti crawler? Take you step by step to analyze and cross Amazon's anti crawler mechanism

Posted by NateDawg on Sat, 29 Jan 2022 22:42:28 +0100

Hello, I'm lex. I like bullying Superman. Lex

Areas of expertise: python development, network security penetration, Windows domain controlled Exchange architecture

Today's focus: analyze and cross Amazon's anti crawler mechanism step by step

it happened like this

Amazon is the world's largest shopping platform

Many commodity information, user evaluation and so on are the most abundant.

Today, hand in hand to take you across Amazon's anti crawler mechanism

Crawl for the products, reviews and other useful information you want

Anti reptile mechanism

However, when we want to use crawlers to crawl relevant data information

Large shopping malls like Amazon, TBao and JD

In order to protect their data information, they all have a perfect anti crawler mechanism

 

Try Amazon's anti crawl mechanism first

We use several different python crawler modules to test step by step

Finally, the anti climbing mechanism was successfully crossed.

1, urllib module

The code is as follows:

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)

Return result: status code: 503.

Analysis: Amazon identifies your request as a crawler and refuses to provide service.

With a scientific and rigorous attitude, let's try Baidu with 10000 people.

Return result: status code 200

Analysis: normal access

 

That shows that the request of urlib module is recognized as a crawler by Amazon and refuses to provide service

 

2, requests module

1. requests direct crawler access

The effect is as follows ↓↓

The code is as follows ↓↓

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)

Return result: status code: 503.

Analysis: Amazon also rejected the request of requses module

Identify it as a crawler and refuse to provide service.

 

2. We add cookie s to requests

Add relevant information such as request cookie s

The effect is as follows ↓↓

The code is as follows ↓↓

import requests

url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx'
web_header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'Yours cookie value',
'TE': 'Trailers'}
r = requests.get(url,headers=web_header)
print(r.status_code)

 

Return result: status code: 200

Analysis: the returned status code is 200. It's normal. It smells like a reptile.

 

3. Check back page

Through the method of requests + cookies, we get the status code of 200

At present, at least Amazon's servers are providing services normally

We write the crawled page into the text and open it through the browser.

 

 

I step on a horse The return status is normal, but an anti crawler verification code page is returned.

Still blocked by Amazon.

 

3, selenium automation module

Installation of relevant selenium modules

pip install selenium

Introduce selenium into the code and set relevant parameters

import os
from requests.api import options
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#selenium configuration parameters
options = Options()
#Configure headless parameters, that is, do not open the browser
options.add_argument('--headless')
#Configure selenium driver of Chrome browser 
chromedriver="C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
#Combine parameter setting + browser drive
browser = webdriver.Chrome(chromedriver,chrome_options=options)

Test access

url = "https://www.amazon.com"
print(url)
#Visit Amazon through selenium
browser.get(url)

 

Return result: status code: 200

Analysis: the returned status code is 200, and the access status is normal. Let's take a look at the information of the web page we climbed to.

 

Save the web page source code locally

#Write the crawled web page information to the local file
fw=open('E:/amzon.html','w',encoding='utf-8')
fw.write(str(browser.page_source))
browser.close()
fw.close()

 

Open the local file we crawled and view it,

We have successfully crossed the anti crawler mechanism and entered the home page of Amazon

 

ending

Through selenium module, we can successfully cross

Amazon's anti crawler mechanism.

Next: let's continue to introduce how to crawl hundreds of thousands of Amazon Product information and comments.

[if you have any questions, please leave a message ~ ~]

 

Recommended reading

python actual combat

[python actual combat]For the ex girlfriend wedding, python cracked the WIFI at the wedding site and changed the name to

[python practice] my ex girlfriend sent an encrypted "520 happy. pdf". After I cracked it in python, I found that

[actual combat in python] last night, I took a self photo of my sister P next door with python, and then found that...

[python actual combat] girlfriend works overtime in the middle of the night and sends selfie. python boyfriend finds out the amazing secret with 30 lines of code

[Python actual combat] python, you are too skinny -- you can record every move of the keyboard in just 30 lines of code

[python[actual combat] I forgot the password of the goddess album. I only wrote 20 lines of code in Python~~~

pygame series articles [subscribe to the column and get the complete source code]

Let's learn pygame together. 30 cases of game development (II) -- tower defense game

Let's learn pygame together. 30 cases of game development (4) -- Tetris games

Practical column of penetration test

Windows AD/Exchange management column

Linux high performance server construction 

PowerShell automation column

 

CSDN official learning recommendation ↓↓

The Python whole stack knowledge map produced by CSDN is too strong. I recommend it to you!

 ​

 

Topics: Python Selenium crawler