When the samples are not randomly distributed in machine learning, the process of creating trace Val test and other files

Posted by jscofield on Sun, 22 Dec 2019 20:40:28 +0100

In the previous blog, I wrote a random learning process for training samples according to the specified proportion. For details, see:

https://blog.csdn.net/lingyunxianhe/article/details/81837978

The premise is that your categories are randomly or more scientifically evenly distributed in the samples, rather than in a certain segment of data in a category set and in a continuous way. In this way, your random samples may make the distribution of train val test very bad

Because of the data manually marked by myself, sometimes for the convenience of marking, the pictures of the same category may be relatively concentrated. I have more than 80% of the samples of the same category in more than 500 consecutive pictures. Therefore, in order to make it reasonable to allocate the train val test, when the samples of a certain category in several consecutive pictures account for a large proportion, they are recorded in a TXT file (using a program Just write the picture name to the txt file), and then allocate the small set according to the allocation proportion of train val (I'm test fixed here). If the test is not fixed, then allocate the small set according to the proportion of train val test. Finally, integrate these small sets into a large set. The specific code is as follows:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 2018/08/11 by DQ
 
import os 
import random 

MidFolder='py-faster-rcnn' 
MainFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/ImageSets/Main')
AnotFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/Annotations')
fileIdLen=6 #
CurImNum=len(os.listdir(AnotFolder)) 

######################last start#############################
def CreateImIdTxt(ImIdS,FilePath):
	if os.path.exists(FilePath):
		os.remove(FilePath)
	with open(FilePath,'w') as FId:
		for ImId in ImIdS:
			ImIdStr=str(ImId).zfill(fileIdLen)+'\n'
			FId.writelines(ImIdStr) 


#Gets the picture collection name of the specified txt document record
def GetPointTxtImIdSet(FilePath):	
	ImIdSet=[]
	if os.path.exists(FilePath):
		with open(FilePath) as FId:
			TxtList=FId.readlines()
			#print TxtList
		for TxtStr in TxtList:
			ImId=TxtStr.split()
			ImIdSet.append(int(ImId[0]))
	return ImIdSet


def AssignImIdSetAsRatio(ImIdSet,TrainR):
	random.shuffle(ImIdSet)
	ImNum=len(ImIdSet)
	TrainNum=int(TrainR*ImNum)	
	TrainImId=ImIdSet[:TrainNum-1]	
	ValImId=list(set(ImIdSet).difference(set(TrainImId)))

	return TrainImId,ValImId


def WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId):
	TrainImId.sort()
	ValImId.sort()
	TrainValImId.sort()

	TrainValTestIds={}
	TrainValTestIds['train']=TrainImId
	TrainValTestIds['val']=ValImId
	TrainValTestIds['trainval']=TrainValImId

	TrainValTestFiles={'train':'train.txt','val':'val.txt','trainval':'trainval.txt'} 
	for Key,KeyVal in TrainValTestFiles.iteritems():
		print 'start create '+ Key+' ImSet'
		ImIdS=TrainValTestIds[Key]
		FileName=TrainValTestFiles[Key]
		FilePath=os.path.join(MainFolder,FileName)
		CreateImIdTxt(ImIdS,FilePath)


def FixTestDeassignTrainVal():
	TrainR=0.7

	SubFolder='TestSetOrOtherBackup'	
	FileName='test.txt'#The test set is fixed. I have two categories here	 
	FilePath=os.path.join(MainFolder,SubFolder,FileName)
	TestImIdSet=GetPointTxtImIdSet(FilePath)
	FileName='7480_8594ManyBlis.txt'	 
	FilePath=os.path.join(MainFolder,SubFolder,FileName)
	ManyBlisImIdSet=GetPointTxtImIdSet(FilePath)#Obtain the picture name with a large proportion of the sample size of a certain category in consecutive pictures recorded in txt	
	FileName='8594-8879ManyBreak.txt'	 
	FilePath=os.path.join(MainFolder,SubFolder,FileName)
	ManyBreakImIdSet=GetPointTxtImIdSet(FilePath)#Obtain the picture name with a large proportion of the sample size of a certain category in consecutive pictures recorded in txt	

	ImIdSet0=range(1,CurImNum+1)
	ImIdSet1=list(set(ImIdSet0).difference(set(TestImIdSet)))#Remove test set from total set
	ImIdSet2=list(set(ImIdSet1).difference(set(ManyBlisImIdSet)))
	ImIdSet=list(set(ImIdSet2).difference(set(ManyBreakImIdSet)))
	
	TrainImId,ValImId=AssignImIdSetAsRatio(ImIdSet,TrainR)#Set of non txt records is prorated
	MBlistTrainImId,MBlistValImId=AssignImIdSetAsRatio(ManyBlisImIdSet,TrainR)#Small sets of txt records are allocated proportionally separately
	MBreakTrainImId,MBreakValImId=AssignImIdSetAsRatio(ManyBreakImIdSet,TrainR)#Small sets of txt records are allocated proportionally separately
        #Merge small sets into large sets
	TrainImId=TrainImId+MBlistTrainImId+MBreakTrainImId
	ValImId=ValImId+MBlistValImId+MBreakValImId
	TrainValImId=ImIdSet1
	
	WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId)
######################last end#############################

FixTestDeassignTrainVal()

Topics: Python

Programmer Think

When the samples are not randomly distributed in machine learning, the process of creating trace Val test and other files

Hot Topics