Python processing verification code: digital computing recognition processing idea based on PIL and Tesseract
As shown in the figure, when we use python automation, we often encounter many kinds of verification codes. This is a verification code for digital addition.
The interference item contains complete number and letter information, and ordinary OCR recognition may not be very accurate.
But anyway, let's set up the necessary environment and try Tesseract's recognition results.
Many people learn python and don't know where to start. After learning python and mastering the basic grammar, many people don't know where to find cases. Many people who have done cases do not know how to learn more advanced knowledge. So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course! QQ group: 701698587 Welcome to join, discuss and study together!
- 1. To install Tesseract:
First, you need to download Tesseract's installation package official website:
https://digi.bib.uni-mannheim.de/tesseract/ , many online tutorials recommend installing the official version without dev in the name, which is said to be more stable
- To configure Tesseract:
After installation, you need to configure the following environment variables, which are divided into two steps:
1. Add the installation path and the tessdata folder path in the installation path.
2. Create a new system variable {TESSDATA_PREFIX: e: \ program files (x86) \ Tesseract OCR \ tessdata} where the variable name is fixed tessdata_ Prefix, the value is the full path of the next level tessdata folder in the installation path just mentioned
Then install pytesseract from the command line:
pip install pytesseract
After completing the above steps, restart the computer.
- Picture cannot be processed or recognized:
If you call ocr directly to identify the result, you only need 3 lines of code:
import pytesseract text = pytesseract.image_to_string('Picture path or memory picture object') print(text)
However, the effect of this verification code is not very good, such as:
Either there is no result, or it is a mess.
This must not work
Then we have to deal with the pictures first
- Image processing and recognition:
I downloaded 20 QR codes of this website and found the following rules:
1. The verification code must contain "= 2 digits + 2 digits"
2. The color of the verification code content is random.
3. The position of the verification code content should be fixed (the plus signs of 20 pictures are in the same position)
4. The interference content of the verification code picture includes letters, numbers and symbols
5. The content as like as two peas of the image is not the same as the main content, but the interference item of each picture must contain the similar color.
It can be seen that according to different fonts, the trunk is brown, but the edge color of the word is slightly lighter. However, no color as like as two peas in the 20 pictures is found.
Therefore, my idea is that because there is an approximate color of the trunk, the main filtering means may lead to the possibility of making the picture more difficult to process. Therefore, it is better to directly obtain the trunk color. All other pixels that are not the trunk color are replaced by white, and then identify after deleting the interference item.
The trunk color can be obtained by using the coordinates of the point in the middle of the fixed plus sign. (80,23)(80,24)
Python code is as follows:
# -*- coding: utf-8 -*- """ Created on Wed Apr 14 16:23:47 2021 @author: roshinntou """ from PIL import Image import pytesseract def images_to_string(index): #When importing pictures, you can directly obtain the io stream img1= Image.open('index ('+str(index)+').png') #Gets the length and width of the picture w,h = img1.size print('Original image size: %sx%s' % (w, h)) ''' the reason being that PNG Picture, pixels are not directly RGB Saved, PNG There is also transparency in each pixel We don't need to deal with transparency, tesseract The recognition of white and opaque is the same. Here it is converted to RGB If the picture is jpg , can be used directly, no need convert ''' img1rbg = img1.convert('RGB') #Read all pixel data src_strlist = img1rbg.load() #Get trunk color data = src_strlist[80,23] print(data) #The double-layer loop begins to replace all pixel colors for x in range(0,w): for y in range(0,h): #Judge whether the current point color is equal to the trunk color co = src_strlist[x,y] if co !=data: src_strlist[x,y] = (245, 245, 255) #Directly call the PIL image object in memory for image recognition text = pytesseract.image_to_string(img1rbg) text = text.replace(" ","").replace("\r\n","").replace(" ","").replace("\r","").replace("\n","") #Print results print(text) #Save picture img1rbg.save(text+'.png') if __name__ == '__main__': for i in range(1,21): images_to_string(i)
The documents are as follows:
Conclusion:
I looked at the accuracy. It should be 100%. The above is the successful cracking of the verification code of the other party's website.
The overall idea of verification code recognition should be like this. Of course, my example is a relatively simple verification code. There are also various troublesome verification codes. In the future, interception, convolution, filtering, cleaning and other methods may need to be used flexibly according to the actual situation, but the overall idea is:
Find the verification code law, clean the interference noise according to the law, and then identify it. I hope it can inspire you.
Finally, now you can get the string of the verification code. The calculation result is very simple, so I won't do it. If you are interested, you can try. I will package all the pictures and source code. You can download and try.
When Tesseract is installed, the system variables cannot be less than two steps. If there is less program execution, an error will be reported. Remember
www.awaedu.com |
www.somanba.cn |
www.sobd.cc |