python script of general testseract training process

Posted by churdaddy on Mon, 24 Jan 2022 17:20:33 +0100

I use testeract version 4.00. On win7, 32-bit.
The training process is mainly divided into two parts. The first part is less orders and more labor. (there are no new technologies here, and I don't know how many thousands of times others train.) Is to generate a box file, and then use jTessBoxEditor to open the picture and adjust the characters and corresponding positions. The second part is to generate the traineddata file. Although there are many commands, there is no need for manual work. It takes tens of seconds.
~~~~~
Start with the recognition command,
tesseract picture text -l language engine
The language engine is the choice of traineddata files.
Training is to adjust the correspondence between picture slices and characters to get a better engine. Then use this engine to identify. Therefore, training is actually to check and correct the corresponding relationship between box position and characters.
Second, let's talk about jtessboxeditor2 Action for 0. There are three tab windows on the left, and the third is to see the enlarged box and characters. I don't understand why the designer doesn't design a close character input port in this window, but sets an input box in the toolbar above, which looks far away. When you want to modify characters, you have to run around. Although the toolbar above can be pulled out, it's ugly. In this window, you can use the keyboard qwer asdf to adjust the position of the box. However, you have to click into the character input box, which is where there is a pinion, and then enter to make the keyboard respond. (I use the old version. I don't know if there are any new improvements now)
~~~~~~~~~~~
In window, the following bat commands are used in the whole process. The following example is thai Font is the language thai and font. These two components can be done by yourself. However, in a complete training process, they should be consistent and cannot be changed arbitrarily.

::-------------------------------------------------------------1
::tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox  
tesseract thai.font.exp0.tif thai.font.exp0 -l thai batch.nochop makebox  
pause
::-------------------------------------------------------------2
tesseract thai.font.exp0.tif thai.font.exp0 nobatch box.train

unicharset_extractor thai.font.exp0.box

@echo thai.font 0 0 0 0 0  >font_properties

shapeclustering -F font_properties -U unicharset thai.font.exp0.tr

mftraining -F font_properties -U unicharset -O thai.unicharset thai.font.exp0.tr 

cntraining.exe thai.font.exp0.tr 


rename normproto thai.normproto 
rename inttemp thai.inttemp 
rename pffmtable thai.pffmtable 
rename shapetable thai.shapetable  


combine_tessdata.exe thai

At first, I split these commands into several bat files, and then click them in order, because sometimes the command runs faster than the result, which will cause 'pipeline' failure. Maybe | can solve the problem, but I have little knowledge.
Then I copy them to python and run cmd with python. Hey, I'm just an amateur. No nonsense, code. There are three documents. Read them for yourself.

from task import *
    
if __name__ =="__main__":
    while True:
        chose_task=input("""Please select a task:

            1 Configuration environment
            
            2 generate box , Then open JTBE adjustment BOX Corresponding character
            
            3 Generate new Traindata

            4 Test new Traindata 

            5 sign out
            
         """)

        if chose_task == "1":  
            configenvironment()
        elif chose_task=="2":
            create_box()
        elif chose_task=="3":
            crate_traindata_after_box_edited()
        elif chose_task=="4":
            # Reference tr_conf file
            # Move the trained nlng + ".trainedata" to path_ Folder specified by TSS
            # Before that, the original trainedata with the same name should be renamed. (if you want to keep it)
            # Then start testseract test picture test - l nlng
            pass
        elif chose_task=="5":
            break
        else:
            break

Specific affairs

import os
import datetime as dt


def printlist(li): #  List display
    for i,k in enumerate(li):
        print ("  ",i,k)

train_configue_file="tr_conf.txt" # Self made configuration file
def loadconf():
    # Read configuration file ------------- to dictionary.
    #(of course, you can also make a configuration file in the form of a dictionary and then import)
    cfdict={}
    sf=open(train_configue_file,"r",encoding="utf-8")
    sl=sf.readlines()
    
    for i in sl:
        mt=i.strip()
        p=mt.find("#")
        if p !=-1:
            mt=mt[:p]
        if mt.find("[")!=-1:
            a,b=mt.split("[")
            cfdict[a.rstrip()]=b.replace("]","").strip()
        else:continue
    sf.close()
    return cfdict
def rewriteconf(dc): # Write configuration file from dictionary
    df=open(train_configue_file,"w",encoding="utf-8")
    for i in dc:
        df.write(i+" ["+dc[i]+"]\n")
    df.close()
        
def whereis_tss_jTBE(): # Two path information
        # Check the testsdata folder
    tss=[]
    jTBE=[]
    conf=[]
    print("--Looking for tessdata and  jTessBoxEditor.jar Directory")
    for disk in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
        ds=disk+":\\"
        if os.path.exists(ds):
            for r,d,f in os.walk(ds):
                for j in d:
                    if j=="tessdata":
                        tss+=[r]
                        break
                for k in f:
                    if k=="jTessBoxEditor.jar":
                        jTBE+=[r]
                        break
    if len(tss)==0: raise RuntimeError("Please install tessract Or add relevant folders")
    print("-------Please select tessdata route-------------")
    printlist(tss)
    ask =input("choice\n ")
    vv=tss[int(ask)]+"\\tessdata"
    conf+=["path_tss ["+vv+"]"]

    #------Setting environment variables--TESSDATA_PREFIX ----  according to path_ts ---#
    #HKCU is HKEY_CURRENT_USER acronym
    keyname="HKCU\Environment"
    TypenN="REG_SZ"
    ValueName="TESSDATA_PREFIX"
    vv='"'+vv +'"' 
    cmd='reg add '+ keyname +' /f /t '+TypenN+ ' /v '+ValueName+" /d "+vv
    os.system(cmd)

    
    print("---------Please select jTessBoxEditor route-------------")
    if len(jTBE)==0:raise RuntimeError("Please install jTessBoxEditor")
    printlist(jTBE)
    ask =input("choice\n  ")
    conf+=["path_jTBE ["+jTBE[int(ask)]+"\\jTessBoxEditor.jar]"]
    
    return conf
def newtrainpicture():
    # Display the pictures in the folder where the script is located, and select the pictures for training
   
    #------------------------------------Search for pictures in this folder
    self_path = os.getcwd() # Get current directory
    fm_list=["jpg","tif","bmp","png"]  
    picc=findfiles(self_path,fm_list)
    
    if picc==[]:
        raise RuntimeError("Please have the picture ready")
    #------------------------------------------------Select Picture
    print(" Select Picture----")
    printlist(picc)
    chsp=picc[int(input("choice  "))]
    return chsp
def renamepicture(fnt_name,chsp):
     # If a new picture is selected, the name of the new picture will be changed to the name of the training format;

    pcfmt=chsp[-4:]
    fnt_picture = fnt_name + pcfmt

    #----------------------------------------------Picture renaming
    if chsp !=fnt_picture:
        cmd ="if exist "+chsp+" copy "+chsp+" "+ fnt_picture
        os.system(cmd)
        cfdict=loadconf()
        cfdict["picture"]=fnt_picture
        rewriteconf(cfdict)
        
    return fnt_picture 


def configenvironment():
    # Parameter name [value] (the space before and after will strip out)
    """ Main contents of configuration file
    path_jTBE [ E:\data\jTessBoxEditor-2.2.0\jTessBoxEditor.jar ] #  jTessBoxEditor software path
    path_tss [ C:\Program Files\Tesseract-OCR\tessdata ]       # Where Tesseract stores traintata
    nlng   [ t ]                 # Language newlang
    nfnt   [ w ]                 # Font newfont
    fnt_pp [ 0 0 0 0 0 ]    #Font properties
    gene [0]                 #Font exp? generation 
    picture []              # Training pictures
    """
    ask=input(" Do you want to search Tessract Where are you?  y/n ")
    if ask !="n":
        conf=whereis_tss_jTBE()
    else:
        conf=[]
        if os.path.exists(train_configue_file):
                cfdict=loadconf()
                path_tss="path_tss ["  +  cfdict["path_tss"]   +  "]"
                path_jTBE="path_jTBE ["  +  cfdict["path_jTBE"]  +  "]"
                conf+=[path_tss]
                conf+=[path_jTBE]
                
    print("--------Please set ocr Language name-------------")
    conf+=["nlng ["+input("\t").strip()+"]"]        
    
    print("---------Please set your font name-------------")
    conf+=["nfnt ["+input("\t").strip()+"]"]
    
    print("--------Please set the characteristics of the font--5 True values (0 or 1)---(Default 5 zeros)--------")
    print("<Italics?> <bold?> <Fixed width?> <Lining?> <Gothic?>")
    v=input("\t").strip()[:5]
    if v=="":
        conf+=["fnt_pp [0 0 0 0 0]"]
    else:
        for i in v:
            if i not in "01":
                print("Please enter five consecutive numbers, 0 or 1")
        v=" ".join(list(v))
        
        conf+=["fnt_pp ["+v+"]"]
       
    
    print("----Select Picture-----")
    conf+=["picture ["+ newtrainpicture()+"]"]

    # Set the number after exp----------
    if not os.path.exists(train_configue_file): #Old file exists?
        conf+=['gene [0]']
    else:
        cd=loadconf()
        gn=int(cd.get("gene",0))
        ask=input('Iteration? y/n ')
        
        if ask=="n": gn="gene ["+str(gn)+"]"; 
        else:gn+=1;gn="gene ["+str(gn)+"]"
        conf+=[gn];
      #--------------------Write configuration file-------
    with open(train_configue_file, 'wt') as f:
                for i in conf:
                    f.write(i+"\n")
            
    printlist(conf)
    print("Configuration completed, please check tr_conf.txt")
    #os.system("pause")
#===============================================
def timenowstr(): # Used to change the name of the file (unique timestamp). Unused
    return dt.datetime.strftime(dt.datetime.now(),"%Y%m%d%H%M%S")

def findfiles(path,fm_list): # Find files by path and return to the list
    fs=os.walk(path)
    chc=[]
    for a,b,c in fs:
        for i in c :
            p= i.rfind(".")
            if i[p+1:] in fm_list:
                chc+=[i]
    print("\n")
    return chc

def cr_filename_soso():
     #  --------------------Generate a file name according to the configuration information
    cfdict=loadconf()

    path_tss    =cfdict.get("path_tss","")
    path_jTBE   =cfdict.get("path_jTBE","")
    nlng        =cfdict.get("nlng","")
    nfnt        =cfdict.get("nfnt","")
    fnt_pp      =cfdict.get("fnt_pp","")
    gene        =cfdict.get("gene","")
    picture    =cfdict.get("picture","")
    
    for i in [path_tss,path_jTBE, nlng, nfnt, fnt_pp, gene, picture ]:
        if i =="":raise RuntimeError("The content of the configuration file is incomplete")
            
    fnt_name= nlng+"."+nfnt+".exp"+gene # Font name, name yourself 
    fnt_box=fnt_name+".box"
    fnt_pp= nlng +"."+nfnt+" "+fnt_pp # Font property settings file
    
    fnt_picture =renamepicture(fnt_name,picture)
    mass= fnt_picture +" "+ fnt_name 
    return  fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass
    
def create_box():
    fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass = cr_filename_soso()
    #-----------------------------------Execute bat command

    ask= input("\n new box The operation may overwrite the previous box Information, are you sure? y/n ")
    if ask =="n":pass
        #------------------------------------------------------------
    else:
            print("Please select a workout box of traindata engine --- ")
            already_lng=findfiles(path_tss,["traineddata"])
            print(already_lng)
            already_lng=[i[:i.find(".")] for i in already_lng]
            printlist(already_lng)

            chose=input("choice\t")
            if chose.isdigit():
                chose=int(chose)
                tr_lang_o =" -l "+already_lng [chose] 
            else:  tr_lang_o=""
            
            cr_box="tesseract "+mass + tr_lang_o +" batch.nochop makebox"
            #----------------------------------------------------------
            os.system( cr_box)
            os.system("pause")
            ask=input("Now open BOX Adjustment? y/n ")
            if ask!="n":
                os.system(path_jTBE) # Edit box

            
def crate_traindata_after_box_edited(): 
    fnt_name,fnt_box,fnt_pp,nlng,path_tss,path_jTBE,mass = cr_filename_soso()
    #------------------------------------bat command rewritten according to the training process
    do="" # You can try it with "echo"
    cr_train="tesseract "+ mass+" nobatch box.train"
    cr_unich="unicharset_extractor "+ fnt_box
    cr_font_prpt="@echo "+fnt_pp+" >font_properties"
    cr_shape_tb="shapeclustering -F font_properties -U unicharset "+fnt_name+".tr"
    cr_intmp_pff="mftraining -F font_properties -U unicharset -O "+nlng+".unicharset "+fnt_name+".tr"
    cr_norm="cntraining.exe "+fnt_name+".tr"

    renms= ["normproto", "inttemp", "pffmtable" , "shapetable "] # rename

    cr_data="combine_tessdata.exe "+nlng
 #---------------------------------------------Execute bat
    os.system( do +cr_train) # 
    os.system( do +cr_unich)
    os.system( do +cr_font_prpt)
    os.system( do +cr_shape_tb)
    os.system( do +cr_intmp_pff)
    os.system( do +cr_norm)

    for i in renms:
            h=nlng+"."+i
           
            rmv="if exist "+ h +" del "+ h
            rnm="if exist "+i+" rename "+i+" "+ h 
            os.system( do +rmv)
            os.system( do +rnm)

    os.system( do +cr_data)


Configuration file example

path_tss [C:\Program Files\Tesseract-OCR\tessdata]
path_jTBE [E:\data\jTessBoxEditor-2.2.0\jTessBoxEditor.jar]nlng [t]nfnt [w]fnt_pp [0 0 0 0 0]picture [t.w.exp1.jpg]gene [1]