In the previous blog, I wrote a random learning process for training samples according to the specified proportion. For details, see:
https://blog.csdn.net/lingyunxianhe/article/details/81837978
The premise is that your categories are randomly or more scientifically evenly distributed in the samples, rather than in a certain segment of data in a category set and in a continuous way. In this way, your random samples may make the distribution of train val test very bad
Because of the data manually marked by myself, sometimes for the convenience of marking, the pictures of the same category may be relatively concentrated. I have more than 80% of the samples of the same category in more than 500 consecutive pictures. Therefore, in order to make it reasonable to allocate the train val test, when the samples of a certain category in several consecutive pictures account for a large proportion, they are recorded in a TXT file (using a program Just write the picture name to the txt file), and then allocate the small set according to the allocation proportion of train val (I'm test fixed here). If the test is not fixed, then allocate the small set according to the proportion of train val test. Finally, integrate these small sets into a large set. The specific code is as follows:
#!/usr/bin/python # -*- coding: UTF-8 -*- # 2018/08/11 by DQ import os import random MidFolder='py-faster-rcnn' MainFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/ImageSets/Main') AnotFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/Annotations') fileIdLen=6 # CurImNum=len(os.listdir(AnotFolder)) ######################last start############################# def CreateImIdTxt(ImIdS,FilePath): if os.path.exists(FilePath): os.remove(FilePath) with open(FilePath,'w') as FId: for ImId in ImIdS: ImIdStr=str(ImId).zfill(fileIdLen)+'\n' FId.writelines(ImIdStr) #Gets the picture collection name of the specified txt document record def GetPointTxtImIdSet(FilePath): ImIdSet=[] if os.path.exists(FilePath): with open(FilePath) as FId: TxtList=FId.readlines() #print TxtList for TxtStr in TxtList: ImId=TxtStr.split() ImIdSet.append(int(ImId[0])) return ImIdSet def AssignImIdSetAsRatio(ImIdSet,TrainR): random.shuffle(ImIdSet) ImNum=len(ImIdSet) TrainNum=int(TrainR*ImNum) TrainImId=ImIdSet[:TrainNum-1] ValImId=list(set(ImIdSet).difference(set(TrainImId))) return TrainImId,ValImId def WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId): TrainImId.sort() ValImId.sort() TrainValImId.sort() TrainValTestIds={} TrainValTestIds['train']=TrainImId TrainValTestIds['val']=ValImId TrainValTestIds['trainval']=TrainValImId TrainValTestFiles={'train':'train.txt','val':'val.txt','trainval':'trainval.txt'} for Key,KeyVal in TrainValTestFiles.iteritems(): print 'start create '+ Key+' ImSet' ImIdS=TrainValTestIds[Key] FileName=TrainValTestFiles[Key] FilePath=os.path.join(MainFolder,FileName) CreateImIdTxt(ImIdS,FilePath) def FixTestDeassignTrainVal(): TrainR=0.7 SubFolder='TestSetOrOtherBackup' FileName='test.txt'#The test set is fixed. I have two categories here FilePath=os.path.join(MainFolder,SubFolder,FileName) TestImIdSet=GetPointTxtImIdSet(FilePath) FileName='7480_8594ManyBlis.txt' FilePath=os.path.join(MainFolder,SubFolder,FileName) ManyBlisImIdSet=GetPointTxtImIdSet(FilePath)#Obtain the picture name with a large proportion of the sample size of a certain category in consecutive pictures recorded in txt FileName='8594-8879ManyBreak.txt' FilePath=os.path.join(MainFolder,SubFolder,FileName) ManyBreakImIdSet=GetPointTxtImIdSet(FilePath)#Obtain the picture name with a large proportion of the sample size of a certain category in consecutive pictures recorded in txt ImIdSet0=range(1,CurImNum+1) ImIdSet1=list(set(ImIdSet0).difference(set(TestImIdSet)))#Remove test set from total set ImIdSet2=list(set(ImIdSet1).difference(set(ManyBlisImIdSet))) ImIdSet=list(set(ImIdSet2).difference(set(ManyBreakImIdSet))) TrainImId,ValImId=AssignImIdSetAsRatio(ImIdSet,TrainR)#Set of non txt records is prorated MBlistTrainImId,MBlistValImId=AssignImIdSetAsRatio(ManyBlisImIdSet,TrainR)#Small sets of txt records are allocated proportionally separately MBreakTrainImId,MBreakValImId=AssignImIdSetAsRatio(ManyBreakImIdSet,TrainR)#Small sets of txt records are allocated proportionally separately #Merge small sets into large sets TrainImId=TrainImId+MBlistTrainImId+MBreakTrainImId ValImId=ValImId+MBlistValImId+MBreakValImId TrainValImId=ImIdSet1 WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId) ######################last end############################# FixTestDeassignTrainVal()