Skip to main content

Handling Video Data with the PyTorch Dataset API

Is Video Recognition a Hot Topic Right Now?

With the increasing availability of large-scale video data from platforms like YouTube and TikTok, research on pose estimation and classification in videos using 3DCNN and ViT is thriving. While state-of-the-art models can be complex to implement, existing approaches like CNN+LSTM can actually achieve quite good accuracy. In this article, we will cover how to create datasets, which can be surprisingly tedious when trying out Video Recognition.

Note that video data is generally handled as a collection of individual frames treated as image data, so please perform that conversion separately. You can look into ffmpeg or torchvision.io for this.

Approach

For this example, we assume the task of learning from a few frames of video and predicting what action is being performed. For instance, we receive 3 frames of video and infer what action is taking place. In other words, we need to output any consecutive 4 frames along with their corresponding label.

import torch

class VideoDataset(torch.utils.data.Dataset):
def __init__(self, frames):
self.idxs = [0,1,2,3,4]
self.data = "a b c d e".split()
self.labels = "a b c d e".split()
self.frames = frames

def __len__(self):
return 5 - (self.frames - 1)

def __getitem__(self, idx):
res = [self.data[i] for i in range(idx, idx + self.frames)]
return res, self.labels[idx + (self.frames - 1)]

In the example above, the paths to each frame of the video are represented as a, b, c, d, e, and the dataset outputs a sequence of three frames along with the label corresponding to the third frame.

train = VideoDataset(frames=3)

for x, t in train:
print(x, t)

# out
['a', 'b', 'c'] c
['b', 'c', 'd'] d
['c', 'd', 'e'] e

By modifying the last line of __getitem__, you can change the position of the output label. Which frame to use as the label for a sequence depends on the problem setting, but using the first frame as the label would essentially mean training on future data, so in most cases the last frame is used as the ground truth. We will revisit this topic in a future article on time series data.

Next time, we will write a classification model using this dataset.