Python coding API (learn to use)

Posted by JamieWAstin on Mon, 17 Jan 2022 16:26:40 +0100

Verification code processing
Learning objectives
Knowledge of verification code
Master the use of image recognition engine
Understand common coding platforms
Master the method of processing verification code through the coding platform
1. Picture verification code
1.1 what is a picture verification code

CAPTCHA is the abbreviation of "fully automated Public Turing test to tell computers and humans apart". It is a public automatic program to distinguish whether the user is a computer or a human.
1.2 function of verification code

Prevent malicious password cracking, ticket swiping, Forum irrigation and page swiping. Effectively prevent a hacker from constantly trying to log in to a specific registered user by brute force cracking with a specific program. In fact, the use of verification code is a common way for many websites (such as online personal bank of China Merchants Bank and Baidu community). We use a simpler way to realize this function. Although the login is a little troublesome, this function is still necessary and important for the password security of netizens.
1.3 usage scenario of picture verification code in crawler

register
Sign in
When sending requests frequently, the server pops up a verification code for verification
1.4 processing scheme of picture verification code

Manual input
This method is limited to sustainable use after one login
Image recognition engine analysis
The optical recognition engine is used to process the data in the picture. At present, it is often used for picture data extraction and less for verification code processing
Coding platform
Common verification code solutions for Crawlers
2. Image recognition engine
OCR (Optical Character Recognition) refers to the software that uses a scanner or digital camera to scan text data into image files, then analyzes and processes the image files, and automatically identifies and obtains text information and layout information.
2.1 what is tesseract
Tesseract, an open source OCR engine developed by HP lab and maintained by Google, is characterized by open source, free, multi language and multi platform support.
Project address: https://github.com/tesseract-ocr/tesseract
2.2 installation of image recognition engine environment
1 engine installation

Executing commands directly in mac environment

Installation under windows Environment
It can be installed through the exe installation package, and the download address can be found from the wiki in the GitHub project. After installation, remember to add the directory of Tesseract execution file to the PATH to facilitate subsequent calls.

Installation in linux Environment


2 installation of Python Library
Enter the following code in terminal under pycharm

2.3 use of image recognition engine
Through the image of the pyteseract module_ to_ String method can extract the data in the open picture file into string data. The specific methods are as follows

2.4 usage expansion of image recognition engine
Simple use and training of tesseract
Other ocr platforms


3 coding platform
1. Why do you need to understand the use of the coding platform
Now many websites will use the verification code to reverse crawl, so in order to better obtain data, we need to know how to use the verification code in the marking platform crawler

2 common coding platform
Cloud coding: http://www.yundama.com/
It can solve the general verification code identification

Polar verification code intelligent identification assistance: http://jiyandoc.c2567.com/
It can solve the identification of complex verification codes

3. Use of cloud coding
Let's take cloud coding as an example to understand how to use the coding platform

3.1 official interface of cloud coding
The following code is provided by the cloud coding platform. A simple modification is made to implement two methods:

indetify: the response binary number of the incoming picture is enough
indetify_by_filepath: the path of the incoming picture can be recognized
Where you need to configure yourself are:

The APIs officially provided by cloud coding are as follows:
#yundama.py

import requests
import json
import time

class YDMHttp:
apiurl = 'http://api.yundama.com/api.php'
username = ''
password = ''
appid = ''
appkey = ''

def __init__(self, username, password, appid, appkey):
    self.username = username
    self.password = password
    self.appid = str(appid)
    self.appkey = appkey

def request(self, fields, files=[]):
    response = self.post_url(self.apiurl, fields, files)
    response = json.loads(response)
    return response

def balance(self):
    data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid,
            'appkey': self.appkey}
    response = self.request(data)
    if (response):
        if (response['ret'] and response['ret'] < 0):
            return response['ret']
        else:
            return response['balance']
    else:
        return -9001

def login(self):
    data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid,
            'appkey': self.appkey}
    response = self.request(data)
    if (response):
        if (response['ret'] and response['ret'] < 0):
            return response['ret']
        else:
            return response['uid']
    else:
        return -9001

def upload(self, filename, codetype, timeout):
    data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid,
            'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)}
    file = {'file': filename}
    response = self.request(data, file)
    if (response):
        if (response['ret'] and response['ret'] < 0):
            return response['ret']
        else:
            return response['cid']
    else:
        return -9001

def result(self, cid):
    data = {'method': 'result', 'username': self.username, 'password': self.password, 'appid': self.appid,
            'appkey': self.appkey, 'cid': str(cid)}
    response = self.request(data)
    return response and response['text'] or ''

def decode(self, filename, codetype, timeout):
    cid = self.upload(filename, codetype, timeout)
    if (cid > 0):
        for i in range(0, timeout):
            result = self.result(cid)
            if (result != ''):
                return cid, result
            else:
                time.sleep(1)
        return -3003, ''
    else:
        return cid, ''

def post_url(self, url, fields, files=[]):
    # for key in files:
    #     files[key] = open(files[key], 'rb');
    res = requests.post(url, files=files, data=fields)
    return res.text 

Username = 'whoarewe' # username

'password = * * * #'

appid = 4283 # appid

appkey = '02074c64f0d0bb9efb2df455537b01c3' # appkey

filename = ‘getimage.jpg '# file location

codetype = 1004 # verification code type

overtime

timeout = 60

def indetify(response_content):
if (username == 'username'):
print('Please set relevant parameters before testing ')
else:
#Initialization
yundama = YDMHttp(username, password, appid, appkey)

    # Login cloud coding
    uid = yundama.login();
    print('uid: %s' % uid)

    # Check the balance
    balance = yundama.balance();
    print('balance: %s' % balance)

    # Start identification, picture path, verification code type ID, timeout (seconds), identification result
    cid, result = yundama.decode(response_content, codetype, timeout)
    print('cid: %s, result: %s' % (cid, result))
    return result

def indetify_by_filepath(file_path):
if (username == 'username'):
print('Please set relevant parameters before testing ')
else:
#Initialization
yundama = YDMHttp(username, password, appid, appkey)

    # Login cloud coding
    uid = yundama.login();
    print('uid: %s' % uid)

    # Check the balance
    balance = yundama.balance();
    print('balance: %s' % balance)

    # Start identification, picture path, verification code type ID, timeout (seconds), identification result
    cid, result = yundama.decode(file_path, codetype, timeout)
    print('cid: %s, result: %s' % (cid, result))
    return result

if name == 'main':
pass

4 types of common verification codes
4.1 the URL address remains unchanged and the verification code remains unchanged
This is a very simple type of verification code. The corresponding only needs to obtain the address of the verification code, and then request it, which can be identified by the coding platform

4.2 the URL address remains unchanged and the verification code changes
This type of verification code is a more common type. For this type of verification code, you need to think about:

In the process of login, assuming that the verification code I entered is correct, how does the other server judge that the verification code I entered is the verification code displayed on my screen instead of other verification codes?

When obtaining the web page, requesting the verification code and submitting the verification code, the other server must have passed some means to verify that the verification code I obtained before and the verification code I submitted last are the same verification code. What is this means?

Obviously, it is implemented through cookies. Accordingly, when requesting the verification code on the request page and submitting the verification code, it is necessary to ensure the consistency of cookies. For this, you can use requests Session