I use testeract version 4.00. On win7, 32-bit.
The training process is mainly divided into two parts. The first part is less orders and more labor. (there are no new technologies here, and I don't know how many thousands of times others train.) Is to generate a box file, and then use jTessBoxEditor to open the picture and adjust the characters and corresponding positions. The second part is to generate the traineddata file. Although there are many commands, there is no need for manual work. It takes tens of seconds.
~~~~~
Start with the recognition command,
tesseract picture text -l language engine
The language engine is the choice of traineddata files.
Training is to adjust the correspondence between picture slices and characters to get a better engine. Then use this engine to identify. Therefore, training is actually to check and correct the corresponding relationship between box position and characters.
Second, let's talk about jtessboxeditor2 Action for 0. There are three tab windows on the left, and the third is to see the enlarged box and characters. I don't understand why the designer doesn't design a close character input port in this window, but sets an input box in the toolbar above, which looks far away. When you want to modify characters, you have to run around. Although the toolbar above can be pulled out, it's ugly. In this window, you can use the keyboard qwer asdf to adjust the position of the box. However, you have to click into the character input box, which is where there is a pinion, and then enter to make the keyboard respond. (I use the old version. I don't know if there are any new improvements now)
~~~~~~~~~~~
In window, the following bat commands are used in the whole process. The following example is thai Font is the language thai and font. These two components can be done by yourself. However, in a complete training process, they should be consistent and cannot be changed arbitrarily.
::-------------------------------------------------------------1 ::tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox tesseract thai.font.exp0.tif thai.font.exp0 -l thai batch.nochop makebox pause ::-------------------------------------------------------------2 tesseract thai.font.exp0.tif thai.font.exp0 nobatch box.train unicharset_extractor thai.font.exp0.box @echo thai.font 0 0 0 0 0 >font_properties shapeclustering -F font_properties -U unicharset thai.font.exp0.tr mftraining -F font_properties -U unicharset -O thai.unicharset thai.font.exp0.tr cntraining.exe thai.font.exp0.tr rename normproto thai.normproto rename inttemp thai.inttemp rename pffmtable thai.pffmtable rename shapetable thai.shapetable combine_tessdata.exe thai
At first, I split these commands into several bat files, and then click them in order, because sometimes the command runs faster than the result, which will cause 'pipeline' failure. Maybe | can solve the problem, but I have little knowledge.
Then I copy them to python and run cmd with python. Hey, I'm just an amateur. No nonsense, code. There are three documents. Read them for yourself.
from task import * if __name__ =="__main__": while True: chose_task=input("""Please select a task: 1 Configuration environment 2 generate box , Then open JTBE adjustment BOX Corresponding character 3 Generate new Traindata 4 Test new Traindata 5 sign out """) if chose_task == "1": configenvironment() elif chose_task=="2": create_box() elif chose_task=="3": crate_traindata_after_box_edited() elif chose_task=="4": # Reference tr_conf file # Move the trained nlng + ".trainedata" to path_ Folder specified by TSS # Before that, the original trainedata with the same name should be renamed. (if you want to keep it) # Then start testseract test picture test - l nlng pass elif chose_task=="5": break else: break
Specific affairs
import os import datetime as dt def printlist(li): # List display for i,k in enumerate(li): print (" ",i,k) train_configue_file="tr_conf.txt" # Self made configuration file def loadconf(): # Read configuration file ------------- to dictionary. #(of course, you can also make a configuration file in the form of a dictionary and then import) cfdict={} sf=open(train_configue_file,"r",encoding="utf-8") sl=sf.readlines() for i in sl: mt=i.strip() p=mt.find("#") if p !=-1: mt=mt[:p] if mt.find("[")!=-1: a,b=mt.split("[") cfdict[a.rstrip()]=b.replace("]","").strip() else:continue sf.close() return cfdict def rewriteconf(dc): # Write configuration file from dictionary df=open(train_configue_file,"w",encoding="utf-8") for i in dc: df.write(i+" ["+dc[i]+"]\n") df.close() def whereis_tss_jTBE(): # Two path information # Check the testsdata folder tss=[] jTBE=[] conf=[] print("--Looking for tessdata and jTessBoxEditor.jar Directory") for disk in "ABCDEFGHIJKLMNOPQRSTUVWXYZ": ds=disk+":\\" if os.path.exists(ds): for r,d,f in os.walk(ds): for j in d: if j=="tessdata": tss+=[r] break for k in f: if k=="jTessBoxEditor.jar": jTBE+=[r] break if len(tss)==0: raise RuntimeError("Please install tessract Or add relevant folders") print("-------Please select tessdata route-------------") printlist(tss) ask =input("choice\n ") vv=tss[int(ask)]+"\\tessdata" conf+=["path_tss ["+vv+"]"] #------Setting environment variables--TESSDATA_PREFIX ---- according to path_ts ---# #HKCU is HKEY_CURRENT_USER acronym keyname="HKCU\Environment" TypenN="REG_SZ" ValueName="TESSDATA_PREFIX" vv='"'+vv +'"' cmd='reg add '+ keyname +' /f /t '+TypenN+ ' /v '+ValueName+" /d "+vv os.system(cmd) print("---------Please select jTessBoxEditor route-------------") if len(jTBE)==0:raise RuntimeError("Please install jTessBoxEditor") printlist(jTBE) ask =input("choice\n ") conf+=["path_jTBE ["+jTBE[int(ask)]+"\\jTessBoxEditor.jar]"] return conf def newtrainpicture(): # Display the pictures in the folder where the script is located, and select the pictures for training #------------------------------------Search for pictures in this folder self_path = os.getcwd() # Get current directory fm_list=["jpg","tif","bmp","png"] picc=findfiles(self_path,fm_list) if picc==[]: raise RuntimeError("Please have the picture ready") #------------------------------------------------Select Picture print(" Select Picture----") printlist(picc) chsp=picc[int(input("choice "))] return chsp def renamepicture(fnt_name,chsp): # If a new picture is selected, the name of the new picture will be changed to the name of the training format; pcfmt=chsp[-4:] fnt_picture = fnt_name + pcfmt #----------------------------------------------Picture renaming if chsp !=fnt_picture: cmd ="if exist "+chsp+" copy "+chsp+" "+ fnt_picture os.system(cmd) cfdict=loadconf() cfdict["picture"]=fnt_picture rewriteconf(cfdict) return fnt_picture def configenvironment(): # Parameter name [value] (the space before and after will strip out) """ Main contents of configuration file path_jTBE [ E:\data\jTessBoxEditor-2.2.0\jTessBoxEditor.jar ] # jTessBoxEditor software path path_tss [ C:\Program Files\Tesseract-OCR\tessdata ] # Where Tesseract stores traintata nlng [ t ] # Language newlang nfnt [ w ] # Font newfont fnt_pp [ 0 0 0 0 0 ] #Font properties gene [0] #Font exp? generation picture [] # Training pictures """ ask=input(" Do you want to search Tessract Where are you? y/n ") if ask !="n": conf=whereis_tss_jTBE() else: conf=[] if os.path.exists(train_configue_file): cfdict=loadconf() path_tss="path_tss [" + cfdict["path_tss"] + "]" path_jTBE="path_jTBE [" + cfdict["path_jTBE"] + "]" conf+=[path_tss] conf+=[path_jTBE] print("--------Please set ocr Language name-------------") conf+=["nlng ["+input("\t").strip()+"]"] print("---------Please set your font name-------------") conf+=["nfnt ["+input("\t").strip()+"]"] print("--------Please set the characteristics of the font--5 True values (0 or 1)---(Default 5 zeros)--------") print("<Italics?> <bold?> <Fixed width?> <Lining?> <Gothic?>") v=input("\t").strip()[:5] if v=="": conf+=["fnt_pp [0 0 0 0 0]"] else: for i in v: if i not in "01": print("Please enter five consecutive numbers, 0 or 1") v=" ".join(list(v)) conf+=["fnt_pp ["+v+"]"] print("----Select Picture-----") conf+=["picture ["+ newtrainpicture()+"]"] # Set the number after exp---------- if not os.path.exists(train_configue_file): #Old file exists? conf+=['gene [0]'] else: cd=loadconf() gn=int(cd.get("gene",0)) ask=input('Iteration? y/n ') if ask=="n": gn="gene ["+str(gn)+"]"; else:gn+=1;gn="gene ["+str(gn)+"]" conf+=[gn]; #--------------------Write configuration file------- with open(train_configue_file, 'wt') as f: for i in conf: f.write(i+"\n") printlist(conf) print("Configuration completed, please check tr_conf.txt") #os.system("pause") #=============================================== def timenowstr(): # Used to change the name of the file (unique timestamp). Unused return dt.datetime.strftime(dt.datetime.now(),"%Y%m%d%H%M%S") def findfiles(path,fm_list): # Find files by path and return to the list fs=os.walk(path) chc=[] for a,b,c in fs: for i in c : p= i.rfind(".") if i[p+1:] in fm_list: chc+=[i] print("\n") return chc def cr_filename_soso(): # --------------------Generate a file name according to the configuration information cfdict=loadconf() path_tss =cfdict.get("path_tss","") path_jTBE =cfdict.get("path_jTBE","") nlng =cfdict.get("nlng","") nfnt =cfdict.get("nfnt","") fnt_pp =cfdict.get("fnt_pp","") gene =cfdict.get("gene","") picture =cfdict.get("picture","") for i in [path_tss,path_jTBE, nlng, nfnt, fnt_pp, gene, picture ]: if i =="":raise RuntimeError("The content of the configuration file is incomplete") fnt_name= nlng+"."+nfnt+".exp"+gene # Font name, name yourself fnt_box=fnt_name+".box" fnt_pp= nlng +"."+nfnt+" "+fnt_pp # Font property settings file fnt_picture =renamepicture(fnt_name,picture) mass= fnt_picture +" "+ fnt_name return fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass def create_box(): fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass = cr_filename_soso() #-----------------------------------Execute bat command ask= input("\n new box The operation may overwrite the previous box Information, are you sure? y/n ") if ask =="n":pass #------------------------------------------------------------ else: print("Please select a workout box of traindata engine --- ") already_lng=findfiles(path_tss,["traineddata"]) print(already_lng) already_lng=[i[:i.find(".")] for i in already_lng] printlist(already_lng) chose=input("choice\t") if chose.isdigit(): chose=int(chose) tr_lang_o =" -l "+already_lng [chose] else: tr_lang_o="" cr_box="tesseract "+mass + tr_lang_o +" batch.nochop makebox" #---------------------------------------------------------- os.system( cr_box) os.system("pause") ask=input("Now open BOX Adjustment? y/n ") if ask!="n": os.system(path_jTBE) # Edit box def crate_traindata_after_box_edited(): fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass = cr_filename_soso() #------------------------------------bat command rewritten according to the training process do="" # You can try it with "echo" cr_train="tesseract "+ mass+" nobatch box.train" cr_unich="unicharset_extractor "+ fnt_box cr_font_prpt="@echo "+fnt_pp+" >font_properties" cr_shape_tb="shapeclustering -F font_properties -U unicharset "+fnt_name+".tr" cr_intmp_pff="mftraining -F font_properties -U unicharset -O "+nlng+".unicharset "+fnt_name+".tr" cr_norm="cntraining.exe "+fnt_name+".tr" renms= ["normproto", "inttemp", "pffmtable" , "shapetable "] # rename cr_data="combine_tessdata.exe "+nlng #---------------------------------------------Execute bat os.system( do +cr_train) # os.system( do +cr_unich) os.system( do +cr_font_prpt) os.system( do +cr_shape_tb) os.system( do +cr_intmp_pff) os.system( do +cr_norm) for i in renms: h=nlng+"."+i rmv="if exist "+ h +" del "+ h rnm="if exist "+i+" rename "+i+" "+ h os.system( do +rmv) os.system( do +rnm) os.system( do +cr_data)
Configuration file example
path_tss [C:\Program Files\Tesseract-OCR\tessdata] path_jTBE [E:\data\jTessBoxEditor-2.2.0\jTessBoxEditor.jar]nlng [t]nfnt [w]fnt_pp [0 0 0 0 0]picture [t.w.exp1.jpg]gene [1]