Crawling Baidu translation (can be translated into Chinese and English)

Posted by Dark_AngeL on Tue, 25 Jan 2022 01:41:41 +0100

Due to an introductory course in Python next semester

So I've been groping for myself during the winter vacation. After all, I can't drop out at that time. It's also a water credit

On a whim recently, I plan to try climbing Baidu translation

After a day's work, the liver finally came out

Don't talk too much and start it directly (the environment is Python 3.8 PyCharm Community Edition 2021.3.1)

Basic steps

Baidu translator will recognize crawlers, so you have to hide them with headers

Take chorme browser as an example

Right click on Baidu translation page and select "check" (or directly F12)

The following interface is displayed

Select network fetch / XHR headers

Then we can see the header we need

What we need are cookies and user agent, which are used to indicate that a specific user opens this website through a browser

That is, camouflage reptiles

Then we can copy it into pychar

1 headers = {"User-Agent": Your User-Agent, "Cookie": Your Cookie}
2 # Fill in the information you obtained later User-Agent and Cookie that will do

Submit Form

After the camouflage, you need to prepare the crawler to submit the form to the website

But before we submit, we need to see what data we want to submit

Continue viewing the website

We can see that there is a form data

from: en
to: zh  # From English to Chinese
query: ant  # search ant word
transtype: realtime  # May it mean real-time query?
simple_means_flag: 3
sign: 210056.513977
token: 97c823341ff704dea2625870404fcec4  # Baidu translation key information for identification sign and token
domain: common

This is the data we want to submit

But the form we submit is dynamic, so we need to rewrite the data

that is

data = {
            "from": "en",
            "to": "zh",
            "query": custom_input,
            "transtype": "translang",
            "simple_means_flag": "3",
            "token": '97c823341ff704dea2625870404fcec4'
        }

Get responses and process results

We consider that after submitting the data, we need to receive feedback from the website

So continue to see where the returned translation is

We will find that there seems to be some difference between what we want and what we actually want

The result is yes, but it's not Chinese, it's Unicode

There are always ways

response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
response.encoding = 'utf-8'
print(response.status_code)  # Get status code
print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0])  # Regular expression to find Chinese characters

In this way, what is printed out is Chinese~

It was a surprise

Almost ready to submit!

Then I excitedly submitted the data

Baidu translation gave me a big mouth and profound lesson

Please enter the translation mode to select
[1]Chinese English translation
[2]Chinese English translation
1
Please enter the English to be translated
apple
200
unknown error
 
The process has ended,Exit code 0

What's going on? Apple's translation should be apple, not an unknown error

Then I found that the previous data missed a sign

sign is calculated differently for different words, but it is fixed relative to words

Fortunately, many online giants found the sign algorithm

https://blog.csdn.net/qq_38534107/article/details/90440403?utm_source=app&app_version=5.0.0&code=app_1562916241&uLinkId=usr1mkqgl919blen

If you are interested, you can see the acquisition of sign algorithm

Finally, paste the sign, and it's successful!

Eliminate warning

But there will be a Warning

Please enter the translation mode to select
[1]Chinese English translation
[2]Chinese English translation
1
Please enter the English to be translated
apple
200
Apple
F:/Python/New/main.py:40: DeprecationWarning: invalid escape sequence '\/'
  print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0])
 
The process has ended,Exit code 0

There is a warning under the translation result, which is not good-looking

So I found a way to join this

import warnings
 
warnings.filterwarnings("ignore", category=Warning)  # Close discard error

There will be no mistakes~

So far, the functions in English translation are almost done

Chinese English translation is basically the same, but there are many things returned, which can be filtered through this statement

print(re.findall(pattern='[a-zA-Z]+', string=response.content.decode('unicode_escape'), flags=re.S)[4])

That's about it~

All codes:

main.py

import requests
from sign import sign
import re
import warnings
 
warnings.filterwarnings("ignore", category=Warning)  # Close discard error
 
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/66.0.3359.139 Safari/537.36",
           "Cookie": 'BIDUPSID=248487DDE4F4874C768DD664800AFB01; '
                     'PSTM=1624632627; '
                     '__yjs_duid'
                     '=1_9e9a49b48ccf294be969148528d703281624677345512; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; '
                     'REALTIME_TRANS_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; APPGUIDE_10_0_2=1; '
                     'BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDUID=39C416629357EBAB497629178C0541C1:FG=1; '
                     'BDUSS'
                     '=m9DMm1RUFZTTFBCNmdZUUFhY3lpeUR4Y3NNRW5SdThvb3FpTnZDNWdXNWRyeEJpSVFBQUFBJCQAAAAAAAAAAAEAAACSX1uneHp5MjAwMzIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAF0i6WFdIulhZ; BDUSS_BFESS=m9DMm1RUFZTTFBCNmdZUUFhY3lpeUR4Y3NNRW5SdThvb3FpTnZDNWdXNWRyeEJpSVFBQUFBJCQAAAAAAAAAAAEAAACSX1uneHp5MjAwMzIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAF0i6WFdIulhZ; H_PS_PSSID=35410_35105_31254_35774_34584_35490_35693_35796_35324_26350_35744; BAIDUID_BFESS=39C416629357EBAB497629178C0541C1:FG=1; BCLID=11903837222192425398; BDSFRCVID=meFOJeC627p69AjHgenlU9pUEeQF9_oTH6aoc1Pmnv6SwQ5bF3wEEG0PEM8g0Kub1VDqogKKQgOTHRCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tbFqoKI5JK03J-Fk-R6BMtCbMfQyetJyaR0tXJvvWJ5TMCoJ0-c25-InbPvwblL8-NT42-ovyJ6_ShPC-tnc3M4nKxC82Mb8Qa743qbX3l02Vhvae-t2ynLIQMFLQ-RMW23I0h7mWUoTsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjjC5DTOXjH8OtTnfb5kXWnbEatD_Hn7zeUDWeM4pbt-qJqTzLNQLWqnjBpRBSDTx3fo1j4tUXxTnBT5KaKTvaCTw5l7KHq32yqKKQlKkQN3TWxuO5bRi5Roy-q3FDn3oypQJXp0n04bly5jtMgOBBJ0yQ4b4OR5JjxonDh83bG7MJPKtfJut_I05JID-bnPk5PQ_b-40Mq0X5-RLfKj-Kq7F5l8-hC3xj6rNMxksbfTQL6cjQmT-blLXXb7xOKQphP-a0-uH5Gjg-h_tKeFeLh5N3KJmsqC9bT3v5tjL34OD2-biWa6M2MbdLqOP_IoG2Mn8M4bb3qOpBtQmJeTxoUJ25DnJhbLGe4bK-TrLjHKftxK; BCLID_BFESS=11903837222192425398; BDSFRCVID_BFESS=meFOJeC627p69AjHgenlU9pUEeQF9_oTH6aoc1Pmnv6SwQ5bF3wEEG0PEM8g0Kub1VDqogKKQgOTHRCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF_BFESS=tbFqoKI5JK03J-Fk-R6BMtCbMfQyetJyaR0tXJvvWJ5TMCoJ0-c25-InbPvwblL8-NT42-ovyJ6_ShPC-tnc3M4nKxC82Mb8Qa743qbX3l02Vhvae-t2ynLIQMFLQ-RMW23I0h7mWUoTsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjjC5DTOXjH8OtTnfb5kXWnbEatD_Hn7zeUDWeM4pbt-qJqTzLNQLWqnjBpRBSDTx3fo1j4tUXxTnBT5KaKTvaCTw5l7KHq32yqKKQlKkQN3TWxuO5bRi5Roy-q3FDn3oypQJXp0n04bly5jtMgOBBJ0yQ4b4OR5JjxonDh83bG7MJPKtfJut_I05JID-bnPk5PQ_b-40Mq0X5-RLfKj-Kq7F5l8-hC3xj6rNMxksbfTQL6cjQmT-blLXXb7xOKQphP-a0-uH5Gjg-h_tKeFeLh5N3KJmsqC9bT3v5tjL34OD2-biWa6M2MbdLqOP_IoG2Mn8M4bb3qOpBtQmJeTxoUJ25DnJhbLGe4bK-TrLjHKftxK; delPer=0; PSINO=3; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1641456854,1642661186,1642662678,1642687449; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1642688201; BA_HECTOR=248g858580ak84a0u91guirq80q; ab_sr=1.0.1_MjM4OGFjMTZiZjUyMmYxMmU5NDhjY2FkZDkzNzRkMzZkZGUxN2RmMmY1NzEwYzg5ZDlmYTk2YTIzZmM0ODBlMzJlYzAwNDMxNjllNjk3OGMxZDJmMzI1NjNiNjlhNjExNTEzYmNkZTFlZWNjYzI4ZGVmZTA4NDk3ODBjYThlYzM='}
 
if __name__ == '__main__':
    print("Please enter the translation mode to select")
    choose = int(input("[1]Chinese English translation\n[2]Chinese English translation\n"))
    while choose != 1 and choose != 2:
        print("Wrong! Please re-enter")
        choose = int(input("[1]Chinese English translation\n[2]Chinese English translation\n"))
        data = {}
    if choose == 1:
        custom_input = input('Please enter the English to be translated\n')
        data = {
            "from": "en",
            "to": "zh",
            "query": custom_input,
            "transtype": "translang",
            "simple_means_flag": "3",
            "token": '97c823341ff704dea2625870404fcec4',
            "sign": sign(custom_input)
        }
        response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
        response.encoding = 'utf-8'print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0])
    elif choose == 2:
        custom_input = input('Please enter the Chinese to be translated into English\n')
        data = {
            "from": "zh",
            "to": "en",
            "query": custom_input,
            "transtype": "translang",
            "simple_means_flag": "3",
            "token": '97c823341ff704dea2625870404fcec4',
            "sign": sign(custom_input)
        }
        response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
        response.encoding = 'utf-8'
        print(re.findall(pattern='[a-zA-Z]+', string=response.content.decode('unicode_escape'), flags=re.S)[4])

sign.py

import js2py
import requests
import re
 
def sign(word):
    session = requests.Session()
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}
    session.headers = headers
    response = session.get("http://fanyi.baidu.com/")
    gtk = re.findall(";window.gtk = ('.*?');", response.content.decode())[0]
    word = word
 
    context = js2py.EvalJs()
    js = r'''
    function a(r) {
          if (Array.isArray(r)) {
              for (var o = 0, t = Array(r.length); o < r.length; o++)
                 t[o] = r[o];
              return t
         }
          return Array.from(r)
     }
      function n(r, o) {
         for (var t = 0; t < o.length - 2; t += 3) {
               var a = o.charAt(t + 2);
              a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
                 a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
                   r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
           }
           return r
       }
      function e(r) {
         var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
         if (null === o) {
                var t = r.length;
                t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
            } else {
              for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++)
                   "" !== e[C] && f.push.apply(f, a(e[C].split(""))),
                   C !== h - 1 && f.push(o[C]);
                var g = f.length;
              g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
            }
           var u = void 0
                , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
          u = 'null !== i ? i : (i = window[l] || "") || ""';
          for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
             var A = r.charCodeAt(v);
             128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
                    S[c++] = A >> 18 | 240,
                   S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
                  S[c++] = A >> 6 & 63 | 128),
                 S[c++] = 63 & A | 128)
         }
          for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
               p += S[b],
                   p = n(p, F);
          return p = n(p, D),
               p ^= s,
          0 > p && (p = (2147483647 & p) + 2147483648),
             p %= 1e6,
           p.toString() + "." + (p ^ m)
       }
    '''
    js = js.replace('\'null !== i ? i : (i = window[l] || "") || ""\'', gtk)
    # implement js
    context.execute(js)
    # Call the function to get sign
    sign = context.e(word)
    return sign

Running example:

Topics: Python Python crawler

Programmer Think

Crawling Baidu translation (can be translated into Chinese and English)

Hot Topics