Object Tracking In Videos

Harnessing The Power Of Object Tracking In Video

Object tracking in videos, or video object tracking, is the process of detecting an object as it moves through space in a video. Video object tracking is used for a variety of applications like traffic control, video editing, medical imaging, tracking faces and eyes for human-computer interaction, and tracking objects or people for surveillance and security.

How exactly is object tracking in videos carried out by a neural network? What strategies and algorithms can be applied to create robust object tracking for videos?

Photo: (https://pixabay.com/illustrations/film-photo-slides-cinema-1668918/) by geralt via Pixabay, Pixabay License (https://pixabay.com/service/license/)

Object Tracking Procedure

Before we look at the different algorithms that can be used to carry out video object tracking, let’s make sure we understand the general process of video object tracking. Object tracking can be broken down into three different sections: initial object detection, assigning IDs, and tracking the object across frames.

To begin with, an initial set of object detections is created. This is typically done by taking a set of bounding box coordinates and using them as inputs for the network. After this, a unique ID is created for each of these initial object detections. 

Next, the objects are continually detected as the frames advance and the objects move, and the unique IDs are maintained. Because unique IDs are assigned to the objects, many different objects can be tracked throughout the video.

The best object tracking algorithms will only have a single object detection phase, and as a result, the running time of the algorithm should be improved. A video object tracking algorithm should also be able to handle instances where an object moves outside the boundaries of the frame (or is occluded by something), as well as be able to pick an object up again if it has “lost” it.

The most commonly used object tracking framework is OpenCV, which is an open-source computer vision library. OpenCV is compatible with the Deep Learning frameworks like PyTorch, Tensorflow, and Caffe. OpenCV includes predefined algorithms for use in tasks like facial recognition, human-computer interaction, mobile robots, and of course object tracking.

Object Tracking For Videos - Algorithms

There are various algorithms used that can be used to track objects in videos, let’s take a look at a few of them. 


GOTURN stands for Generic Object Tracking Using Regression Networks, and it uses deep neural networks to track objects in an offline fashion. This is notable because most tracking algorithms train online, which is to say the algorithm learns how the object appears only at runtime. In contrast, GOTURN is trained on thousands of chunks of video before runtime, and as a result, it doesn’t need to train at all during runtime.

GOTURN operates by taking two different frames as the input and outputting the bounding box around the object within the second frame. The first frame is the “previous frame”, and in this frame, the location of the object is already known, while in the second frame the location of the object needs to be predicted. The first frame will always have the object centered, but since the object can move it need not be in the center of the second frame. A Convolutional Neural Network is used to predict the location of the bounding box within the second frame.


Online trackers are those that learn the features of objects at runtime, in contrast to offline trackers like GOTURN. One example of an online video object tracker is Multi-domain Network or MDNET. Because training a deep neural network is computationally expensive, small networks are used for training around deployment time. The drawback to small networks is that they lack the classification/discrimination power of larger networks. 

In order to deal with the fact that networks which train at runtime have lower discriminatory power, the training of the network can be split into different steps. 
For instance, the entire network can be trained before runtime, but during runtime, the first few layers of the network are used as feature extractors and only the last few layers of the network have their weights adjusted. Essentially, the CNNs are trained beforehand and used to extract features, while the last layers can quickly be trained online. Theoretically, this creates a multi-domain CNN that can be used in many different scenarios, capable of discriminating between background and target.

In practice, the background of one video could be the target of a different video, and so the CNN must have some method of discriminating between these two situations. MDNet handles possible confusion from similar targets and backgrounds by dividing the network into two portions, a shared portion and a portion that remains independent for every domain.

Every domain has its own training video, and the network is trained for the total number of different domains. After training is complete, the layers specific to the different domains are removed and as a result, a feature extractor capable of interpreting any given background/object pairs is created. During the process of inference, a binary classification layer is created by removing the domain-specific layers and adding a binary classifier.

In practice, the background of one video could be the target of a different video, and so the CNN must have some method of discriminating between these two situations. MDNet handles possible confusion from similar targets and backgrounds by dividing the network into two portions, a shared portion and a portion that remains independent for every domain.


There is a third type of video object tracking algorithm, one that combines convolutional neural networks with Long Short Term Memory (LSTM) networks

One kind of object detection algorithm is the You Only Look Once, or YOLO, algorithm. The YOLO algorithm functions by dividing an image into a grid and tracking the Intersection Over Union (IOU), or the amount of overlap between bounding boxes. The bounding boxes that the YOLO algorithm tracks are the anchor boxes, which contain the ground truth label and position of the object, and the predicted bounding box. The two bounding boxes are compared for similarity, with the network selecting predictions that have the most overlap.

Recurrent YOLO, or ROLO, is a kind of network which combines LSTMs and convolutional layers in an online detection based algorithm. The YOLO network is used for the purpose of object detection, while the LSTM network is responsible for determining the target object’s direction of movement. The inclusion of an LSTM in the model means that the network is extremely good at learning historical patterns and relatively inexpensive computation-wise, making them well suited to visual object tracking.

The frames are first put through the YOLO network, and two different outputs are extracted by this network. Bounding box coordinates and image features are both extracted from the input frame. Both of these outputs then go on to enter the LSTM portion of the network, and the LSTM outputs the trajectories of the bounding boxes so that the object can be tracked.

When the YOLO network does location inference it assists the LSTM in highlighting specific visual elements. The ROLO algorithm explores the location history of the object, alongside the temporal history of the object. ROLO tracking manages to stay stable even if the YOLO network’s observations are inaccurate due to severe motion blur.

Multiple Object Tracking With OpenCV

As previously mentioned, one of the most common methods of implementing video object tracking is using the OpenCV framework and tracking centroids. Centroid tracking in OpenCV operates by determining the Euclidean distance between existing, already known/labeled object centroids, and new object centroids over the subsequent frames of a video. 

First, an object detector is used to create bounding boxes, and once these bounding boxes are accepted the centroids for the objects can be computed. Recurrent CNNs are common network choices used to create the bounding boxes. 

After the initial bounding boxes have been made, the euclidean distance between the existing objects and the new bounding boxes can be calculated. If the initial bounding boxes and the new bounding boxes can be associated with one another, the coordinates of the existing object are updated to the new bounding box location. While a given object will potentially move between frames, the distance between a given object’s centroid and the objects surrounding it should be greater than the distance between the object's centroid at one frame and the object’s centroid at the next frame. If there are more detected inputs than the current number of objects being tracked, a new object is registered.

Object tracking for videos is an extremely important aspect of computer vision, used in everything from autonomous driving to human-computer interaction. There are a number of different algorithms that can be used to facilitate object tracking for videos, like GOTURN, MDNET, and ROLO. The algorithm you choose for object tracking should reflect your needs, as different algorithms perform better under different circumstances.