About the js reverse crawler of Netease (crawling comments)
This is the first blog on this site. There may be many deficiencies, which will be continuously revised later. I hope you will actively point out the mistakes
1. Preparation tools:
- pychram
- vscode (test JavaScript)
- node.js
- Any browser (Google or Edge recommended)
After doing this, make sure that python and Javascript can run on the computer
2. Knowledge reserve:
- python request process for web pages
- A small number of JS syntax bases that can be ignored
3. Start web page analysis
website: https://music.163.com/#/discover/toplist?id=3778678 (this is the comment crawling of Netease cloud hot song list. The encryption code of Netease cloud is the same whether it is music, comments or others. Only taking this as a representative, you can modify and crawl different contents slightly)
Enter the website, directly F12, click network, click refresh, and click View Fetch/XHR only, as shown in the following figure:
Then view the contents of the file through preview and locate the get... File, as shown in the figure:
Then look at the header, look at the request header inside, um..... They are all normal parameters, and there are no suspected encrypted parameters
Not yet. The request header tells us that this is a POST request, so continue to view the form data provided
Click payload
There are even two parameters in the payload, namely params and encSecKey. You can see that these are encryption parameters at a glance. In that case, we go to his initiator to find the encryption location for cracking
Click to enter the initiator, as shown in the figure:
First click to enter the first one, click {} in the lower left corner to increase the readability of the code, and then mark the breakpoint on the line of the blue bar in the code,
Repeat and do the same for the first five.
Click refresh to observe the prompt parameters, as shown in the figure:
Find the encryption parameters we need here, and then observe how to generate them. As shown in the figure above, we pay attention to the parameter bum2x. Our encryption parameters come from this bum2x. In the line 13413 in the figure above, the generation method is
var bUM2x = window.asrsea(JSON.stringify(i6c), bsG7z(["shed tears", "strong"]), bsG7z(WW3x.md), bsG7z(["love", "girl", "terrified", "laugh"]));
Let's break down this line of code:
Encryption function: window.asrsea() Parameters of function: JSON.stringify(i6c) bsG7z(["shed tears", "strong"]) bsG7z(WW3x.md) bsG7z(["love", "girl", "terrified", "laugh"])
Regardless of these four parameters, let's first look at how the function is encrypted. If it's simple, rewrite it in Python. If it's too complex, deduct the code directly, make it into a JS script, run it in Python's execjs package, and enter window on the console After asrsea(), you can directly give the function position and enter it for viewing. You can also place the mouse in window On asrsea, the function position will be given automatically. Click to enter, as shown in the figure
After entering, it is found that the function is actually a function called * * d(d, e, f, g) * *. The specific encryption logic is as follows:
function d(d, e, f, g) { var h = {} , i = a(16); return h.encText = b(d, g), h.encText = b(h.encText, i), h.encSecKey = c(i, e, f), h }
Let's analyze the function implementation process:
- There are three unknown functions a(), b(),c() in the function
- Encryption process: i =a(16), h.encText experienced b() function twice, and h.encSecKey experienced c function once.
Now you need to find the function that does not exist. Using the above method, place the mouse over the function to automatically display the position
Here we find the encryption of a(),b(),c() functions:
What do you find? There are unknown functions in b(),c(). In this case, we can't use python to restore functions next to analysis. We can use another method to deduct the encrypted js code and put it in the js file. You can obtain the encryption parameters by executing the js file. Here, you need your computer to have a javascript execution environment: node js and VScode are used as javascript compilers. The specific configuration can be Baidu. There is no more explanation here. We will paste the deducted js code into one:
function a(a) { var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = ""; for (d = 0; a > d; d += 1) e = Math.random() * b.length, e = Math.floor(e), c += b.charAt(e); return c } function b(a, b) { var c = CryptoJS.enc.Utf8.parse(b) , d = CryptoJS.enc.Utf8.parse("0102030405060708") , e = CryptoJS.enc.Utf8.parse(a) , f = CryptoJS.AES.encrypt(e, c, { iv: d, mode: CryptoJS.mode.CBC }); return f.toString() } function c(a, b, c) { var d, e; return setMaxDigits(131), d = new RSAKeyPair(b,"",c), e = encryptedString(d, a) } function d(d, e, f, g) { var h = {} , i = a(16); return h.encText = b(d, g), h.encText = b(h.encText, i), h.encSecKey = c(i, e, f), h }
Then, add an execution function run() at the end:
function run(){ dd='' e='' f='' g='' d1=d(dd,e,f,g); data={ params: d1.encText, encSecKey:d1.encSecKey } console.log(data); } run()
Then let's do it
Uncaught ReferenceError: CryptoJS is not defined
There is no unexpected error. The function of CryptoJS is not found. In the b() function, this function is invoked. It is actually a function introduced by an external package. We need to import it. In the full text, only the CryptoJS function needs external import, and the rest can be copied in the text.
npm install crypto-js//Input at terminal
After importing, you should also import in js article:
CryptoJS=require('crypto-js')
Run again:
setMaxDigits is not defined
Normally, we go back to the browser, This is the setMaxDigits() function in c () inside, we put the mouse over the function, follow the previous method to the position of the function, button down the code, paste directly, and then go to the browser to find the function and paste it. Here is a trick: there are more codes in the back. When an error is reported and no function is defined, you can use ctrl+f to find out which function the function is referenced in For example:
biToHex is not defined
Then there are hundreds of lines of code,
We can directly ctrl+f:
Then go to the browser and find the encryptedString(a, b) function. Then use the previous method to locate the function location, deduct the code and paste it,
Finally, when there are about 370 lines of code, it runs successfully. Here it just runs successfully, but it doesn't output anything. Why??
Let's go back to the run() function. The parameters DD, e, F and G we put in this function are empty. Next, we need to find the specific value of the parameter
function run(){ dd='' e='' f='' g='' d1=d(dd,e,f,g); data={ params: d1.encText, encSecKey:d1.encSecKey } console.log(data); } run()
Do you know how we found the d() function
d1=d(dd,e,f,g);
Through this:
Then, we can find the parameters in the console and deduct the four parameters,
There is also a pit here. Let's look at the Internet,
There is nothing, so we don't know which file encryption parameters are loaded,
Here, except for batch, there are encryption parameters. We need to obtain the parameters one by one, and then correspond to the file:
In this way, it can also be found that only the first dd parameter is changed, and the other three are unchanged
We found dd:
"{"rid":"A_PL_0_3778678","threadId":"A_PL_0_3778678","pageNo":"1","pageSize":"20","cursor":"-1","offset":"0","orderType":"1","csrf_token":""}"
It is then carried into the run() code
Encryption parameters obtained successfully. There are follow-up articles. Please pay attention. Thank you. I'm still a crawler Xiaobai. I have made mistakes. Please also point out that if it helps you, please give a praise. If the article has an impact on you, please inform you in time and modify it immediately. My personal blog will be published at the same time: Click to enter