Introduction and Processing of UCF101 Action Recognition Dataset

Posted by johnwayne77 on Wed, 09 Feb 2022 17:39:21 +0100

1. Introduction to datasets:

UCF101 is a motion recognition dataset for real action videos collected on YouTube and provides 13320 videos from 101 action categories. Official website: https://www.crcv.ucf.edu/research/data-sets/ucf101/

Dataset name: UCF-101 (2012)
Total Videos: 13,320 Videos
Total duration: 27 hours
Video source: YouTube collection
Video category: 101
There are five main types of actions: human interaction with objects, simple body movements, human interaction, playing instruments, Sports
Each category (folder) is divided into 25 groups, with 4-7 short videos in each group, each of which has a different duration
Specific categories: eye make-up, lipstick, archery, baby crawling, balance bar, band parade, baseball court, basketball shooting, basketball dump, lying push, cycling, billiard shooting, hair drying, candle blowing, weight squatting, bowling, sandbags, boxing speed bag, breaststroke, brushing, cleaning and jerking, cliff diving, cricket bowling, cricket shooting, Cutting, diving, drumming, fencing, hockey fines, floor gymnastics, flying disc catch, front crawl, golf swing, haircut, hammer throw, hammer, upside-down push-ups, upside-down walking, head massage, high jump, horse riding, hula-hoop, ice dance, javelin throw, juggle ball, jump rope, jack, kayak, knitting, long jump, stabbing, parade, Mixed hitter, mopping floor, nursery clip, parallel bars, pizza toss, guitar, piano, tabra, violin, cello, Daf, Dhol, flute, sitar, pole jumping, saddle horse, pull-up, fist, push-up, rafting, indoor rock climbing, rope climbing, rowing, salsa rotation, shaving, lead ball, skating, skiing, skiing, Skijet, skydiving, football juggling, football free throws, static rings, sumo wrestling, surfing, swing, table tennis rackets, taijiquan, tennis swing, discus throwing, trampoline jumping, typing, high and low bars, volleyball spikes, walking with dogs, wall push-ups, writing on boats, skating. Shaving, shot putting, skating, boarding, skiing, Skijet, skydiving, football juggling, football fines, still life rings, sumo, surfing, swing, table tennis shooting, Taijiquan, tennis swing, discus throwing, trampoline jumping, typing, uneven bars, volleyball spikes, walking with dogs, wall push-ups, writing on boats, skating. Shaving, shot putting, skating boarding, skiing, Skijet, skydiving, football juggling, football fine, still life ring, sumo, surfing, swing, table tennis shooting, Taijiquan, tennis swing, discus throwing, trampoline jumping, typing, uneven bars, volleyball spikes, walking with dogs, wall pushups, writing on board, skating

2. Data set acquisition and decompression:

1. Data Download

UCF101 data download address: http://crcv.ucf.edu/data/UCF101/UCF101.rar

Official data division download address: https://www.crcv.ucf.edu/wp-content/uploads/2019/03/UCF101TrainTestSplits-RecognitionTask.zip

Note: The size of the dataset is 6.46G, and there are three ways to divide the data, which you can choose to use

2. Data set decompression:

The dataset is a compressed file of rar, decompressed with rar, and cd to the corresponding folder

rar x UCF101.rar

After decompression, it is the standard catalog format for classified datasets. The secondary catalog is named Human Activity Category, and the corresponding video data is in the secondary catalog.

Each short video is of varying duration (from zero to more than a dozen seconds), 320*240 in size, with an irregular frame rate of 25 or 29 frames, and contains only one type of human behavior in a video.

Note: If you do not have rar locally, you need to install it. Install the reference on Linux rar Tool Installation and Common Commands on Linux Among them, if you do not have permission to contact the administrator for installation, and if the server has docker use, you can use the chmod command to change container permissions for installation

3. Data Set Partition

Unzip the downloaded UCF101TrainTestSplits-Recognition Task, as shown in the following figure, in three ways

Choose your own partition method. This paper uses the first partition method to move the validation set to the valfolder and divide the code:

import shutil,os

txtlist = ['testlist01.txt']
dataset_dir = './UCF-101/'   #Data Storage Path
copy_path = './val/'         #Verification Set Storage Path

for txtfile in txtlist:
	for line in open(txtfile, 'r'):
		o_filename = dataset_dir + line.strip()
		n_filename = copy_path + line.strip()
		if not os.path.exists('/'.join(n_filename.split('/')[:-1])):
			os.makedirs('/'.join(n_filename.split('/')[:-1]))
		shutil.move(o_filename, n_filename)

IV. Data Set Preprocessing

Data processing can be loaded in two ways: first, the video file is generated into a pkl file for processing, or directly for video processing

1. Generate pkl file

Convert video files to pkl files to speed up data reading, code:

import os
from pathlib import Path
import random
import cv2
import numpy as np
import pickle as pk
from tqdm import tqdm
from PIL import Image

import multiprocessing
import time

import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset


class VideoDataset(Dataset):

    def __init__(self, directory, local_rank, num_local_rank, resize_shape=[168, 168] , mode='val', clip_len=8, frame_sample_rate=2):
        folder = Path(directory)  # get the directory of the specified split
        print("Load dataset from folder : ", folder)
        self.clip_len = clip_len
        self.resize_shape = resize_shape

        self.frame_sample_rate = frame_sample_rate
        self.mode = mode

        self.fnames, labels = [], []
        for label in sorted(os.listdir(folder))[:200]:
            for fname in os.listdir(os.path.join(folder, label)):
                self.fnames.append(os.path.join(folder, label, fname))
                labels.append(label)
        '''
        random_list = list(zip(self.fnames, labels))
        random.shuffle(random_list)
        self.fnames[:], labels[:] = zip(*random_list)
        '''
        # prepare a mapping between the label names (strings) and indices (ints)
        self.label2index = {label: index for index, label in enumerate(sorted(set(labels)))}
        # convert the list of label names into an array of label indices
        self.label_array = np.array([self.label2index[label] for label in labels], dtype=int)

        label_file = str(len(os.listdir(folder))) + 'class_labels.txt'
        with open(label_file, 'w') as f:
            for id, label in enumerate(sorted(self.label2index)):
                f.writelines(str(id + 1) + ' ' + label + '\n')
        if mode == 'train' or 'val' and num_local_rank > 1:
            single_num_ = len(self.fnames)//24
            self.fnames = self.fnames[local_rank*single_num_:((local_rank+1)*single_num_)]
            labels = labels[local_rank*single_num_:((local_rank+1)*single_num_)]

        for file in tqdm(self.fnames, ncols=80):
            fname = file.split("/")
            self.directory = '/root/dataset/{}/{}'.format(fname[-3],fname[-2])

            if os.path.exists('{}/{}.pkl'.format(self.directory, fname[-1])):
                continue
            else:
                capture = cv2.VideoCapture(file)
                frame_count = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
                if frame_count > self.clip_len:
                    buffer = self.loadvideo(capture, frame_count, file)
                else:
                    while frame_count < self.clip_len:
                        index = np.random.randint(self.__len__())
                        capture = cv2.VideoCapture(self.fnames[index])
                        frame_count = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
                        buffer = self.loadvideo(capture, frame_count, file)


    def __getitem__(self, index):
        # loading and preprocessing. TODO move them to transform classes
        return index


    def __len__(self):
        return len(self.fnames)


    def loadvideo(self, capture, frame_count, fname):
        # initialize a VideoCapture object to read video data into a numpy array
        self.transform_nor = transforms.Compose([
                                transforms.Resize([224, 224]),
                                ])

        # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later
        start_idx = 0
        end_idx = frame_count-1
        frame_count_sample = frame_count // self.frame_sample_rate - 1
        if frame_count>300:
            end_idx = np.random.randint(300, frame_count)
            start_idx = end_idx - 300
            frame_count_sample = 301 // self.frame_sample_rate - 1
        buffer_normal = np.empty((frame_count_sample, 224, 224, 3), np.dtype('uint8'))

        count = 0
        retaining = True
        sample_count = 0

        # read in each frame, one at a time into the numpy buffer array
        while (count <= end_idx and retaining):
            retaining, frame = capture.read()
            if count < start_idx:
                count += 1
                continue

            if retaining is False or count > end_idx:
                break

            if count%self.frame_sample_rate == (self.frame_sample_rate-1) and sample_count < frame_count_sample:
                frame = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                buffer_normal[sample_count] = self.transform_nor(frame)

                sample_count += 1
            count += 1

        fname = fname.split("/")
        self.directory = '/root/dataset/{}/{}'.format(fname[-3],fname[-2])
        if not os.path.exists(self.directory):
            os.makedirs(self.directory)
        # Save tensor to .pkl file
        with open('{}/{}.pkl'.format(self.directory, fname[-1]), 'wb') as Normal_writer:
            pk.dump(buffer_normal, Normal_writer)

        capture.release()
        
        return buffer_normal


if __name__ == '__main__':

    datapath = '/root/dataset/UCF101'
    process_num = 24

    for i in range(process_num):
        p = multiprocessing.Process(target=VideoDataset, args=(datapath, i, process_num))
        p.start()

    print('CPU core number:' + str(multiprocessing.cpu_count()))

    for p in multiprocessing.active_children():
        print('Subprocess' + p.name + ' id: ' + str(p.pid))
    print('all done')

Then process the pkl file

import os
from pathlib import Path

import random
import cv2

import numpy as np
import pickle as pk
from tqdm import tqdm
from PIL import Image

import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset


class VideoDataset(Dataset):

    def __init__(self, directory_list, local_rank=0, enable_GPUs_num=0, distributed_load=False, resize_shape=[224, 224] , mode='train', clip_len=32, crop_size=160):
        
        self.clip_len, self.crop_size, self.resize_shape = clip_len, crop_size, resize_shape
        self.mode = mode

        self.fnames, labels = [], []
        # get the directory of the specified split
        for directory in directory_list:
            folder = Path(directory)
            print("Load dataset from folder : ", folder)
            for label in sorted(os.listdir(folder)):
                for fname in os.listdir(os.path.join(folder, label)) if mode=="train" else os.listdir(os.path.join(folder, label))[:10]:
                    self.fnames.append(os.path.join(folder, label, fname))
                    labels.append(label)

        random_list = list(zip(self.fnames, labels))
        random.shuffle(random_list)
        self.fnames[:], labels[:] = zip(*random_list)

        # self.fnames = self.fnames[:240]
        '''
        if mode == 'train' and distributed_load:
            single_num_ = len(self.fnames)//enable_GPUs_num
            self.fnames = self.fnames[local_rank*single_num_:((local_rank+1)*single_num_)]
            labels = labels[local_rank*single_num_:((local_rank+1)*single_num_)]
        '''
        # prepare a mapping between the label names (strings) and indices (ints)
        self.label2index = {label:index for index, label in enumerate(sorted(set(labels)))} 
        # convert the list of label names into an array of label indices
        self.label_array = np.array([self.label2index[label] for label in labels], dtype=int)


    def __getitem__(self, index):
        # loading and preprocessing. TODO move them to transform classess
        buffer = self.loadvideo(self.fnames[index])

        if self.mode == 'train':
            height_index = np.random.randint(buffer.shape[2] - self.crop_size)
            width_index = np.random.randint(buffer.shape[3] - self.crop_size)
            return buffer[:,:,height_index:height_index + self.crop_size, width_index:width_index + self.crop_size], self.label_array[index]
        else:
            return buffer, self.label_array[index]


    def __len__(self):
        return len(self.fnames)


    def loadvideo(self, fname):
        # initialize a VideoCapture object to read video data into a numpy array
        with open(fname, 'rb') as Video_reader:
            video = pk.load(Video_reader)

        while video.shape[0]<self.clip_len+2:
            index = np.random.randint(self.__len__())
            with open(self.fnames[index], 'rb') as Video_reader:
                video = pk.load(Video_reader)

        height, width = video.shape[1], video.shape[2]
        center = (height//2, width//2)

        flip, flipCode = True if np.random.random() < 0.5 else False, 1
        #rotation, rotationCode = True if np.random.random() < 0.2 else False, random.choice([-270,-180,-90,90,180,270])

        speed_rate = np.random.randint(1, 3) if video.shape[0] > self.clip_len*2+2 and self.mode == "train" else 1
        time_index = np.random.randint(video.shape[0]-self.clip_len*speed_rate)

        video = video[time_index:time_index+(self.clip_len*speed_rate):speed_rate,:,:,:]

        self.transform = transforms.Compose([
                         transforms.Resize([self.resize_shape[0], self.resize_shape[1]]),
                         transforms.ToTensor(),
                         transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
                         ])

        self.transform_val = transforms.Compose([
                             transforms.Resize([self.crop_size, self.crop_size]),
                             transforms.ToTensor(),
                             transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
                             ])

        if self.mode == 'train':
            # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later
            buffer = np.empty((self.clip_len, 3, self.resize_shape[0], self.resize_shape[1]), np.dtype('float16'))
            for idx, frame in enumerate(video):
                if flip:
                    frame = cv2.flip(frame, flipCode=flipCode)
                '''
                if rotation:
                    rot_mat = cv2.getRotationMatrix2D(center, rotationCode, 1)
                    frame = cv2.warpAffine(frame, rot_mat, (height, width))
                '''
                buffer[idx] = self.transform(Image.fromarray(frame))
        
        elif self.mode == 'validation':
            # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later
            buffer = np.empty((self.clip_len, 3, self.crop_size, self.crop_size), np.dtype('float16'))
            for idx, frame in enumerate(video):
                buffer[idx] = self.transform_val(Image.fromarray(frame))

        return buffer.transpose((1, 0, 2, 3))


if __name__ == '__main__':

    datapath = ['/root/data2/dataset/UCF-101']
    
    dataset = VideoDataset(datapath, 
                            resize_shape=[224, 224],
                            mode='validation')
    
    dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=0)
    
    bar = tqdm(total=len(dataloader), ncols=80)

    for step, (buffer, labels) in enumerate(dataloader):
        print(buffer.shape)
        print("label: ", labels)
        bar.update(1)

2. Processing video files directly

The overall process is similar to a pkl file except that the processing body becomes a video file, code:

import os
from pathlib import Path

import random

import numpy as np
import pickle as pk
import cv2
from tqdm import tqdm
from PIL import Image

import torchvision.transforms as transforms
import torch

from prefetch_generator import BackgroundGenerator
from torch.utils.data import DataLoader, Dataset


class VideoDataset(Dataset):

    def __init__(self, directory_list, local_rank=0, enable_GPUs_num=0, distributed_load=False, resize_shape=[224, 224] , mode='train', clip_len=32, crop_size = 168):
        
        self.clip_len, self.crop_size, self.resize_shape = clip_len, crop_size, resize_shape
        self.mode = mode

        self.fnames, labels = [],[]
        # get the directory of the specified split
        for directory in directory_list:
            folder = Path(directory)
            print("Load dataset from folder : ", folder)
            for label in sorted(os.listdir(folder)):
                for fname in os.listdir(os.path.join(folder, label)) if mode=="train" else os.listdir(os.path.join(folder, label))[:10]:
                    self.fnames.append(os.path.join(folder, label, fname))
                    labels.append(label)

        random_list = list(zip(self.fnames, labels))
        random.shuffle(random_list)
        self.fnames[:], labels[:] = zip(*random_list)

        # self.fnames = self.fnames[:240]

        if mode == 'train' and distributed_load:
            single_num_ = len(self.fnames)//enable_GPUs_num
            self.fnames = self.fnames[local_rank*single_num_:((local_rank+1)*single_num_)]
            labels = labels[local_rank*single_num_:((local_rank+1)*single_num_)]

        # prepare a mapping between the label names (strings) and indices (ints)
        self.label2index = {label:index for index, label in enumerate(sorted(set(labels)))} 
        # convert the list of label names into an array of label indices
        self.label_array = np.array([self.label2index[label] for label in labels], dtype=int)

                
    def __getitem__(self, index):
        # loading and preprocessing. TODO move them to transform classess
        buffer = self.loadvideo(self.fnames[index])
        
        height_index = np.random.randint(buffer.shape[2] - self.crop_size)
        width_index = np.random.randint(buffer.shape[3] - self.crop_size)

        return buffer[:,:,height_index:height_index + self.crop_size, width_index:width_index + self.crop_size], self.label_array[index]


    def __len__(self):
        return len(self.fnames)


    def loadvideo(self, fname):
        # initialize a VideoCapture object to read video data into a numpy array
        self.transform = transforms.Compose([
                transforms.Resize([self.resize_shape[0], self.resize_shape[1]]),
                transforms.ToTensor(),
                transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
                ])

        flip, flipCode = 1, random.choice([-1,0,1]) if np.random.random() < 0.5 and self.mode=="train" else 0

        try:
            video_stream = cv2.VideoCapture(fname)
            frame_count = int(video_stream.get(cv2.CAP_PROP_FRAME_COUNT))
        except RuntimeError:
            index = np.random.randint(self.__len__())
            video_stream = cv2.VideoCapture(self.fnames[index])
            frame_count = int(video_stream.get(cv2.CAP_PROP_FRAME_COUNT))

        while frame_count<self.clip_len+2:
            index = np.random.randint(self.__len__())
            video_stream = cv2.VideoCapture(self.fnames[index])
            frame_count = int(video_stream.get(cv2.CAP_PROP_FRAME_COUNT))

        speed_rate = np.random.randint(1, 3) if frame_count > self.clip_len*2+2 else 1
        time_index = np.random.randint(frame_count - self.clip_len * speed_rate)

        start_idx, end_idx, final_idx = time_index, time_index+(self.clip_len*speed_rate), frame_count-1
        count, sample_count, retaining = 0, 0, True

        # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later
        buffer = np.empty((self.clip_len, 3, self.resize_shape[0], self.resize_shape[1]), np.dtype('float16'))
        
        while (count <= end_idx and retaining):
            retaining, frame = video_stream.read()
            if count < start_idx:
                count += 1
                continue
            if count % speed_rate == speed_rate-1 and count >= start_idx and sample_count < self.clip_len:
                if flip:
                    frame = cv2.flip(frame, flipCode=flipCode)
                try:
                    buffer[sample_count] = self.transform(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
                except cv2.error as err:
                    continue
                sample_count += 1
            count += 1
        video_stream.release()

        return buffer.transpose((1, 0, 2, 3))


if __name__ == '__main__':

    datapath = ['/root/data1/datasets/UCF-101']
    
    dataset = VideoDataset(datapath, 
                            resize_shape=[224, 224],
                            mode='validation')
    
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=24, pin_memory=True)

    bar = tqdm(total=len(dataloader), ncols=80)

    prefetcher = DataPrefetcher(BackgroundGenerator(dataloader), 0)
    batch = prefetcher.next()
    iter_id = 0
    while batch is not None:
        iter_id += 1
        bar.update(1)
        if iter_id >= len(dataloader):
            break

        batch = prefetcher.next()
        print(batch[0].shape)
        print("label: ", batch[1])

    '''
    for step, (buffer, labels) in enumerate(BackgroundGenerator(dataloader)):
        print(buffer.shape)
        print("label: ", labels)
        bar.update(1)
    '''

Topics: Python Pytorch Deep Learning Data Mining

Programmer Think