[Python crawler advanced learning] - JS reverse hundred examples - Baidu translation interface parameter reverse

Posted by freenity on Fri, 21 Jan 2022 02:04:02 +0100

 

Reverse target

Reverse process

Packet capture analysis

We can see that the translation result comes out without refreshing the page when we casually enter text on the baidu translation page. It can be inferred that it is loaded by Ajax. Open the developer tool and select XHR to filter Ajax requests. You can see that there is a URL of https://fanyi.baidu.com/v2transapi?from=zh&to=en When we enter "test", the data returned by the POST request is similar to the following structure after Unicode to Chinese:

{
	"trans_result": {
		"data": [{
			"dst": "test",
			"prefixWrap": 0,
			"result": [
				[0, "test", ["0|6"],
					[],
					["0|6"],
					["0|4"]
				]
			],
			"src": "test"
		}],
		"from": "zh",
		"status": 0,
		"to": "en",
		"type": 2
	},
	"dict_result": {
        // slightly
    },
	"liju_result": {
        // slightly
    }
}

trans_result is the result of translation, dict_result is more translation results, liju_result is an example sentence, tag, etc., so this URL is the translation interface we need.

Since it is a POST request, we observe its Form Data:

  • from: language to be translated;
  • to: target language;
  • query: string to be translated;
  • transtype: real-time translation # realtime, manually click translate # translang;
  • simple_means_flag, domain: fixed value;
  • sign: if the string to be translated changes, its value will also change, which needs further analysis;
  • token: its value will not change, but I don't know how it came from. It needs further analysis.

In the process of capturing packets, we also noticed that a URL is https://fanyi.baidu.com/langdetect And the data it returns is as follows:

{"error":0,"msg":"success","lan":"zh"}

Obviously, this is a language that automatically detects the string to be translated. Its Form Data is also very simple. query is the string to be translated. This interface can be used according to the actual scene.

 

Get token

Because the value of token is fixed, we can search directly. We can find it in the source code of the home page and extract it directly by using regular expressions.

 

Get sign

Sign , will change. It is suspected that it is generated dynamically by JS, so we try to search , sign globally. Here is a trick. Only searching , sign , will produce many results. You can add colon or equal sign to narrow the scope. Search , sign: you can search in index_a8b7098.js, we can find that the data at line 8392 is the most comprehensive, which is consistent with the Form Data seen in the previous packet capture.

Click the line number, bury the breakpoint here, and click the translate button to see the successful disconnection. At this time, the value of {sign} is the final value we want:

Here, the string to be translated is passed into the {L} function, the mouse is placed on the} L} function, and directly click the follow-up function. It can be found that the value of {sign} is actually obtained after a series of operations of the {function e(r) function. Directly copy the function for local debugging. During debugging, it can be found that there is a missing value of {i} in the Closure column on the right, Or select , i with the mouse, you can see the value of , i , which is fixed after multiple debugging, and you can write it directly:

 

If you continue to debug function e(r), you will also be prompted that a function n is missing. Then follow up the function directly and copy the function n together.

 

Complete code

baidu_encrypt.js

Get the value of # sign #:

var i = '320305.131321201'

function n(r, o) {
    for (var t = 0; t < o.length - 2; t += 3) {
        var a = o.charAt(t + 2);
        a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
    }
    return r
}

function e(r) {
    var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
    if (null === o) {
        var t = r.length;
        t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
    } else {
        for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" !== e[C] && f.push.apply(f, a(e[C].split(""))), C !== h - 1 && f.push(o[C]);
        var g = f.length;
        g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
    }
    var u = void 0, l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
    u = null !== i ? i : (i = window[l] || "") || "";
    for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
        var A = r.charCodeAt(v);
        128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128)
    }
    for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F);
    return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m)
}

// console.log(e('test '))

baidufanyi.py#

#!/usr/bin/env python3
# -*- coding: utf-8 -*-


import re

import execjs
import requests


index_url = 'https://fanyi.baidu.com/'
lang_url = 'https://fanyi.baidu.com/langdetect'
translate_api = 'https://fanyi.baidu.com/v2transapi'
headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': 'BIDUPSID=3BE16D933E9C0182F2A6E93D7A9D1424; PSTM=1623723330; BAIDUID=8496908995397662040287D2CE1C4224:FG=1; __yjs_duid=1_779078c2c847bb3217554b8549ad49bd1623728424311; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; FANYI_WORD_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BDSFRCVID_BFESS=BkFOJeCT5G3_WP5eFqJ2T4D2p2KKN9OTTPjcTR5qJ04BtyCVNKsaEG0PtOgMNBDbJ2MRogKKLgOTHULF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJ4toCPMJI_3fP36q45HMt00qxby26PDajn9aJ5nQI5nhU7505oqDJ0Z0ROOWhRute3i2DTvQUbmjRO206oay6O3LlO83h5wW57KKl0MLPbcep68LxODy6DI0xnMBMnr52OnaU513fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4D5_ae5O3DGRf-b-XKD600PK8Kb7VbUF6qfnkbft7jtteyhbTJCID-UQKQPnc_pC4yURFef473b3B5h3NJ66ZoIbPbPTTSlroKPQpQT8r5-nMWx6G3IrZoq64ab3vOpRTXpO13fAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjDjJRCOoI--f-3bfTrP-trf5DCShUFs3tnlB2Q-5M-a3KOrSUtGbfjay6D7j-8HbTjiW2_82MbmLncjSM_GKfC2jMD32tbpWfneKmTxoUJ2Bb3Y8loe-xCKXqDebPRiWPb9QgbP2pQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0HPonHjKbDTvL3f; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1624438637,1624603638,1624928461,1624953786; H_PS_PSSID=34131_34099_31253_34004_33607_34107_34135; delPer=0; PSINO=6; BAIDUID_BFESS=8496908995397662040287D2CE1C4224:FG=1; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1624962661; __yjs_st=2_MzJhZTMxZGU5MjZjNGJiZTJiZjQwYjVkMWM5ZjYyMGFjZDlkMDJmNTU3OGU5ZTM4N2JjNjNkODAwYWJiY2M3NDA1NWEyODNkMzNkMDEzNThiZTU4NzNhMTQxYzIxOTQyMzg3MjhiMzA5ZjY2MDczZTBhZDdmZDg4YTFhNjVmZTMwZTYyZTRjNmRhMWNmYzg3NDFjODYzYTRlZTE2NzBmODAyMWI4MTI3NTZmNjg1MDk4OWIxZTYzNTc4NzhjY2E3NzU3ZGYyZmI1ODdjZTM5ZDNlOGU0ZGQ2NzE5OGU2NzUzM2ZhZTcxZmVjNjI4MDIyN2Y1N2NlMzZmMmRlY2U4Yl83XzQ5NzQ4ZWE4; ab_sr=1.0.1_MmUwODU0NGE4NjIwZmY4NjgxZmM1NGYxOTI5ZWQwOGU2NjU3ZjgwNzhkMTNjNDI5NWE0ODQwYzlkZDVjY2Q1YWEyZDQyZWI0ZjNkMWQ0NTEyMGFjYzdiNDdmNzYxYjNiMjkxZTI1M2I3Y2VhZGE3NDEzOTgyMjY1MjBlZGM4OGJiZGVjMzFkYTM3ODgyMTRkZjJhMGYzNGM0MGJmMGY1Yg==',
    'Host': 'fanyi.baidu.com',
    'Origin': 'https://fanyi.baidu.com',
    'Referer': 'https://fanyi.baidu.com/',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'sec-ch-ua-mobile': '?0',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}


def get_token():
    response = requests.get(url=index_url, headers=headers).text
    token = re.findall(r"token: '([0-9a-z]+)", response)[0]
    return token


def get_sign(query):
    with open('baidu_encrypt.js', 'r', encoding='utf-8') as f:
        baidu_js = f.read()
    sign = execjs.compile(baidu_js).call('e', query)
    return sign


def get_result(lang, query, sign, token):
    data = {
        'from': lang,
        'to': 'en',
        'query': query,
        'transtype': 'realtime',
        'simple_means_flag': '3',
        'sign': sign,
        'token': token,
    }
    response = requests.post(url=translate_api, headers=headers, data=data)
    result = response.json()['trans_result']['data'][0]['dst']
    return result


def main():
    query = input('Please enter the text to be translated:')
    response = requests.post(url=lang_url, headers=headers, data={'query': query})
    lang = response.json()['lan']
    token = get_token()
    sign = get_sign(query)
    result = get_result(lang, query, sign, token)
    print('The results translated into English are:', result)


if __name__ == '__main__':
    main()

 

Topics: Javascript