Python processing verification code: digital computing recognition processing idea based on PIL and Tesseract

Posted by archbeta on Sat, 20 Nov 2021 06:33:12 +0100

Python processing verification code: digital computing recognition processing idea based on PIL and Tesseract

As shown in the figure, when we use python automation, we often encounter many kinds of verification codes. This is a verification code for digital addition.
The interference item contains complete number and letter information, and ordinary OCR recognition may not be very accurate.
But anyway, let's set up the necessary environment and try Tesseract's recognition results.

Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!
QQ group: 701698587
 Welcome to join, discuss and study together!

  • 1. To install Tesseract:

First, you need to download Tesseract's installation package official website:
https://digi.bib.uni-mannheim.de/tesseract/ , many online tutorials recommend installing the official version without dev in the name, which is said to be more stable
 

  • To configure Tesseract:

After installation, you need to configure the following environment variables, which are divided into two steps:
1. Add the installation path and the tessdata folder path in the installation path.

2. Create a new system variable {TESSDATA_PREFIX: e: \ program files (x86) \ Tesseract OCR \ tessdata} where the variable name is fixed tessdata_ Prefix, the value is the full path of the next level tessdata folder in the installation path just mentioned

Then install pytesseract from the command line:

pip install pytesseract

After completing the above steps, restart the computer.
 

  • Picture cannot be processed or recognized:

If you call ocr directly to identify the result, you only need 3 lines of code:

import pytesseract
text = pytesseract.image_to_string('Picture path or memory picture object')
print(text)

However, the effect of this verification code is not very good, such as:

Either there is no result, or it is a mess.
This must not work
Then we have to deal with the pictures first
 

  • Image processing and recognition:


I downloaded 20 QR codes of this website and found the following rules:
1. The verification code must contain "= 2 digits + 2 digits"
2. The color of the verification code content is random.
3. The position of the verification code content should be fixed (the plus signs of 20 pictures are in the same position)
4. The interference content of the verification code picture includes letters, numbers and symbols
5. The content as like as two peas of the image is not the same as the main content, but the interference item of each picture must contain the similar color.

It can be seen that according to different fonts, the trunk is brown, but the edge color of the word is slightly lighter. However, no color as like as two peas in the 20 pictures is found.
Therefore, my idea is that because there is an approximate color of the trunk, the main filtering means may lead to the possibility of making the picture more difficult to process. Therefore, it is better to directly obtain the trunk color. All other pixels that are not the trunk color are replaced by white, and then identify after deleting the interference item.
The trunk color can be obtained by using the coordinates of the point in the middle of the fixed plus sign. (80,23)(80,24)

Python code is as follows:

# -*- coding: utf-8 -*-
"""
Created on Wed Apr 14 16:23:47 2021
 
@author: roshinntou
"""
 
 
 
from PIL import Image
import pytesseract
 
def images_to_string(index):
    #When importing pictures, you can directly obtain the io stream
    img1= Image.open('index ('+str(index)+').png')
     
    #Gets the length and width of the picture
    w,h = img1.size
    print('Original image size: %sx%s' % (w, h))
     
     
    '''
    the reason being that PNG Picture, pixels are not directly RGB Saved, PNG There is also transparency in each pixel
    We don't need to deal with transparency, tesseract The recognition of white and opaque is the same. Here it is converted to RGB
    If the picture is jpg , can be used directly, no need convert
    '''
    img1rbg = img1.convert('RGB')
     
    #Read all pixel data
    src_strlist = img1rbg.load()
     
    #Get trunk color
    data = src_strlist[80,23]
    print(data)
     
    #The double-layer loop begins to replace all pixel colors
    for x in range(0,w):
        for y in range(0,h):
            #Judge whether the current point color is equal to the trunk color
            co = src_strlist[x,y]
            if co !=data:
                src_strlist[x,y] = (245, 245, 255)
     
     
    #Directly call the PIL image object in memory for image recognition
    text = pytesseract.image_to_string(img1rbg)
    text = text.replace(" ","").replace("\r\n","").replace(" ","").replace("\r","").replace("\n","")
    #Print results
    print(text)
     
    #Save picture
    img1rbg.save(text+'.png')
     
     
if __name__ == '__main__':
    for i in range(1,21):
        images_to_string(i)
    

The documents are as follows:

Conclusion:
I looked at the accuracy. It should be 100%. The above is the successful cracking of the verification code of the other party's website.
The overall idea of verification code recognition should be like this. Of course, my example is a relatively simple verification code. There are also various troublesome verification codes. In the future, interception, convolution, filtering, cleaning and other methods may need to be used flexibly according to the actual situation, but the overall idea is:
Find the verification code law, clean the interference noise according to the law, and then identify it. I hope it can inspire you.
Finally, now you can get the string of the verification code. The calculation result is very simple, so I won't do it. If you are interested, you can try. I will package all the pictures and source code. You can download and try.
When Tesseract is installed, the system variables cannot be less than two steps. If there is less program execution, an error will be reported. Remember

www.awaedu.com
www.somanba.cn
www.sobd.cc

Topics: Python AI