[pytoch learning notes] data import

Posted by tysoncane on Tue, 11 Jan 2022 15:35:04 +0100

1. Preface

The data import of Pytorch depends on torch utils. data. Dataloader and torch utils. data. Dataset (or torch.utils.data.IterableDataset).

2. torch.utils.data.DataLoader learning

In torch utils. As mentioned in the official data document, torch utils. data. Dataloader is the core tool of pytorch data import. It returns an iteratable object to extract the data in the dataset.

The DataLoader parameters are as follows:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

2.1 Dataset

The dataset parameter represents the dataset object to load data. PyTorch supports two different types of datasets: map style datasets and Iterable style datasets.

2.1.1 Map-style datasets

Map style dataset application__ getitem() and__ len() is a mapping relationship between index / key value and dataset sample. That is to say, instead of reading the data, the dataset reads the index / key value of the data, and then accesses the data through this index / key value. When we access the dataset[idx], we can read the idx picture and its corresponding label from the disk. This process is a random access to the data samples on the disk.
Among them__ getitem__ () returns the corresponding dataset sample through the given index / key value, and len() returns the size of the dataset.

2.1.2 Iterable-style datasets

Iterative style dataset application__ iter()__ The protocol is an iterative data set. Iterable_ The style dataset is read data. This type of data set is particularly suitable for situations where random reading is expensive or even impossible, and where the batch size depends on the data taken.
Note: for multi process loading data, the same data will be read repeatedly.
e.g.:

# should give same set of data as range(3, 7), i.e., [3, 4, 5, 6].
# Single-process loading
print(list(torch.utils.data.DataLoader(ds, num_workers=0)))
# [3, 4, 5, 6]
# Directly doing multi-process loading yields duplicate data
print(list(torch.utils.data.DataLoader(ds, num_workers=2)))
# [3, 3, 4, 4, 5, 5, 6, 6]

2.1.3 example

Load CIFAR10 dataset

import os
import pickle
import numpy as np

def load_CIFAR_batch(filename):
    '''load single batch of cifar'''
    with open(filename, 'rb') as f:
        datadict = pickle.load(f, encoding='latin1')
        x = datadict['data']
        print(x.shape)
        y = datadict['labels']
        X = x.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
        Y=np.array(y)
        return X, Y

def load_CIFAR10(path):
    '''load all of cifar'''
    xs = []
    ys = []
    for b in range(1,6):
        f = os.path.join(path,'data_batch_%d' % (b, ))
        X, Y = load_CIFAR_batch(f)
        xs.append(X)
        ys.append(Y)
    Xtr = np.concatenate(xs)
    Ytr = np.concatenate(ys)
    del X, Y
    Xte, Yte = load_CIFAR_batch(os.path.join(path, 'test_batch'))
    return Xtr, Ytr, Xte, Yte

Xtr, Ytr, Xte, Yte = load_CIFAR10('D:\\data set\\cifar-10\\')

Map style and iterative style

import torch
from torch.utils.data import Dataset, IterableDataset, DataLoader

# Xtr, Xte expansion
Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3)
Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3)

# Create subclass
class MyMapstyleDataset(Dataset):
    def __init__(self, datas, labels):
        # super(MyMapstyleDataset).__init__()
        self.datas = datas
        self.labels = labels
        
    def __getitem__(self, index):
        data = torch.tensor(self.datas[index])
        label = torch.tensor(self.labels[index])
        return data, label
    
    def __len__(self):
        return len(self.datas)
    
Dataset = MyMapstyleDataset(Xtr_rows, Ytr)
train_dataloader = DataLoader(IterableDataset, shuffle=False, batch_size=8, num_workers=0)


# Create subclass
class MyIterstyleDataset(IterableDataset):
    def __init__(self, start, end):
        super(MyIterstyleDataset).__init__()
        assert end > start, "Error"
        self.start = start
        self.end = end
        # self.filepath = filepath
        
    def _sample_gernerator(self, start, end):
        # When a large amount of data cannot be loaded into memory at one time, it can be loaded in the form of data flow through this function
        for i in range(end-start):
            sample = {"data":torch.tensor(Xtr[start+i, :]), "label":torch.tensor(Ytr[start+i, :])}
            yield sample
        
    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:# Single Worker
            iter_start = self.start
            iter_end = self.end
        else:# Multiple Workers
            per_work = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
            worker_id = worker_info.id
            iter_start = self.start + worker_id * per_worker
            iter_end = min(iter_start + per_worker, self.end)
        sample_iterator = self._sample_gernerator(iter_start, iter_end)
        
        return sample_iterator
    
    def __len__(self):
        return self.end - self.start
    

IterableDataset = MyIterstyleDataset(0, len(Xtr_rows))
Iter_train_dataloader = DataLoader(IterableDataset, shuffle=False, batch_size=8, num_workers=0)

2.2 Sampler

Sampler provides a variety of data reading methods. For Iterable style datasets, data reading is customized. Therefore, sampler is more used in MapStyle dataset. torch. utils. data. The sampler class generates a specific index / key value sequence for data reading. It is an iteratable object of the index / key value of the dataset.

class Sampler(Gerneric[T_co]):
	def __init__(self, data_source) -> None:
		pass
	
	def __iter__(self) -> Iterator[T_co]:
		raise NotImplementedError 

class SequentialSampler(Sampler[int])
'''Sequential sampling, the order is always the same.'''

class RandomSampler(Sampler[int])
'''Random sampling,"replacement"by True May be modified when dataset Size, see"num_samples"'''

class SubsetRandomSampler(Sampler[int]):
'''According to the given index/Key value sequence sampling'''

class WeightedRandomSampler(Sampler[int]):
'''yes[0,...,len(weights)-1]Sample, as per weights Probability, random sampling num_samples Samples'''

Simple sequential samplers and shuffled samplers can use the shuffle parameter of DataLoader: True is shuffled; False is sequential.
Similarly, users can customize the sampler to provide a__ iter()__ Method to generate the next index / key value each iteration.

2.3 Automatic batching(batch_sampler)

Batch size: how many samples each batch contains (int, default: 1)
drop_last: whether to discard the last incomplete batch (bool, false by default)
batch_sampler: similar to sampler, the index / key value of batch is generated one at a time.

class BatchSampler(Sampler[list[int]]):
	def __init__(self, sampler, batch_size, drop_last):
		self.sampler = sampler
		self.batch_size = batch_size
		self.drop_last = drop_last
	
	def __iter__(self) -> Iterator[List[int]]:
		batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch
	
	def __len__(self) -> int:
		if self.drop_last:
            return len(self.sampler)
        else:
            return (len(self.sampler) + self.batch_size - 1)

DataLoader supports automatic collation of single acquired data into batch, including batch size and drop_ The last parameter is used to construct a batch from the sampler_ sampler .
Note: when using multiple processes to read data from an iterative style dataset, drop_ The last parameter discards the last incomplete batch of the dataset copy of each worker.

'''read Map-Style Data set( Automatic batching Enable)'''
for incices in batch_sampler:
	yield collate_fn([dataset[i] for i in indices])

'''read Iterable-Style Data set( Automatic batching Enable)'''
dataset_iter = iter(dataset)
for indices in batch_sampler:
	yield collate_fn([next(dataset_iter) for _ in indices])

When batch_size (default is 1, 1 is not None) and batch_ Automatic batching is disabled when all samplers (default) are None.

'''read Map-Style Data set( Automatic batching Disnable)'''
for index in sampler:
	yield collate_fn(dataset[index])

'''read Iterable-Style Data set( Automatic batching Disnable)'''
for data in iter(dataset):
	yield collate_fn(data)

2.4 collate_fn

When sampler or batch_ After the sampler obtains the data samples, use collate_ The FN function sorts the sample list into batch.

When Automatic batching is disabled: collapse_ FN is called by each individual data sample. In this case, the default is collate_fn simply converts the NumPy array to a PyTorch tensor.
When Automatic batching is disabled: collapse_ FN is called by each data sample list. Sort the read sample list into batch.
e. G. each sample in the dataset is a tuple of (image, class_index), collate_fn returns image tensor and label tensor.

About the default collate_fn :

Add a dimension to represent batchsize;
Convert nparray to tensor
Keep the original data structures, such as dictionaries and tuples.

You can customize the collate_fn implement custom batching, for example, filling various long sequences.

3. Reference

[1]. torch.utils.data.DataLoader official documentation
[2]. Use of Python iterabledataset
[3]. Load cifar_ ten

Topics: Machine Learning Pytorch Deep Learning

Programmer Think