Python full stack development - Python crawler - 09 JS reverse entry

Posted by subkida on Sun, 23 Jan 2022 10:43:02 +0100

1. JavaScript anti crawler principle and reason

Crawler and website security, one is spear, the other is shield.

Is your website safe?

  • First, check whether the safety measures are in place,
  • Second, it also depends on whether the data value will attract the attention of "crawlers". In other words, unless no crawler is staring at your data, you must take anti climbing measures step by step!

2. Python calls JavaScript to execute code

PyExecJS Library:

Introduction:

This library mainly runs js code in the local js environment

advantage:

  • There are many options for js environments. PyV8 and node are officially recommended js, phantom js and Nashorn

Disadvantages:

  • An environment must be installed, which is not very light, and there is a startup environment process when calling, which is obviously slow.

Installation:

  • First install the local js environment. It is recommended to install node js, the installation is simple, and the execution efficiency is also very good
  • pip install PyExecJS
>>> import execjs
# instantiation 
>>> execjs.get().name  # View the invoked environment
'Node.js (V8)'
>>> ctx = execjs.compile(
"""  # Execute JS statement
     function add(x, y) {
         return x + y;
     }
"""
)
>>> ctx.call("add", 1, 2)#Call function, pass parameters
3
>>> with open('./test.js') as f:  # Execute JS file
...     ctx = execjs.compile(f.read())
...     ctx.call('add', 1, 2)

PyV8

  • This is the library that Google officially encapsulates the Chrome V8 engine in Python

  • Compared with PyExecJS, this library is very lightweight and does not need to install additional JS environment, because V8 itself is the environment

  • At the same time, because there is no need to start the external environment, the execution speed is very fast.

  • Installation: download the binary file of the corresponding system here: emmetio / pyv8 binaries

    >>> import PyV8  # Pay attention to case
    >>> with PyV8.JSContext() as ctx:
    ...     ctx.eval("""
    ...         function add(x, y) {
    ...             return x + y;
    ...         }
    ...     """)
    ...     ctx.locals.add(1, 2)
    

3. Locate encrypted data

In the face of js code confusion and encryption, we need to be able to locate the specific location in the js file where the required data exists in the js reverse analysis. The general idea and process are as follows:

  • In the console, first search the target keyword globally by ctrl + shift +F;

  • Search the js file path of the target keyword one by one in many search results, generally in the source resource;

  • After retrieving the js file where the target keyword is located, format and view the js code;

  • The keyword location is retrieved by ctrl + F in js file, and its context code is analyzed;

  • If other parameters or functions are referenced, continue to search the corresponding keywords;

  • Until the encryption algorithm of the target data is found - usually corresponding to a function or a part of the function implementation in a function.

Data encoding and encryption

base64 encoding

  • Base64 is the most common on the network for transmitting 8Bit Bytecode Base64 is an encoding method based on 64 printable characters Binary Method of data. You can view RFC2045 ~ RFC2049, which has the detailed specification of MIME.

  • Base64 encoding is a process from binary to character, which can be used in HTTP Long identification information is transmitted in the environment. Base64 encoding is unreadable and can only be read after decoding.

  • Base64 is widely used in various fields of computer due to the above advantages. However, since the output content includes more than two "symbol class" characters (+, /, =), various "variants" of Base64 have been developed in different application scenarios. In order to unify and normalize the output of Base64, Base62x is regarded as an unsigned improved version

  • Is the most common on the network

Implementation of base64 codec with Python

code
import base64
a = 'HC'.decode() #Convert 'HC' to binary
b = base64.b64encode(a) #Convert a to base64 encoding
b.decode() #Reverse from binary
  
base64.b64encode('HC'.encode()).decode() #Abbreviation 'SEM=
decode
base64.b64decode('SEM=').decode()
'HC'

MD5 encryption algorithm

MD5 message digest algorithm (English: MD5 message digest algorithm), a widely used algorithm Cryptographic hash function , a 128 bit (16 bit) can be generated byte )The hash value of is used to ensure the integrity and consistency of information transmission.

Python handles MD5 encryption
# Because the MD5 module was removed in Python 3
# md5 operation using hashlib module in Python 3

import hashlib

# Information to be encrypted
str = 'this is a md5 test.'

# Create md5 object
m = hashlib.md5()


# Tips
# encode is required here
# If the writing method is m.update(str), the error is: Unicode objects must be encoded before hashing
# Because the default str in Python 3 is unicode
# Or b = bytes(str, encoding='utf-8 '), which has the same function. Both encode is bytes
b = str.encode(encoding='utf-8')
m.update(b)
str_md5 = m.hexdigest()

print('MD5 Before encryption:' + str)
print('MD5 After encryption:' + str_md5)

# Another way to write it: the prefix "b" represents bytes
str_md5 = hashlib.md5(b'this is a md5 test.').hexdigest()
print('MD5 After encryption:' + str_md5)

Topics: Python Javascript