In terms of deep learning, object detection refers to the process of having a deep neural network recognize different objects within an image. This can be done in several different ways, but no matter how the task is carried out, object detection is critical for applications like autonomous driving, robot item sorting, and facial recognition.
So what is deep learning exactly? It is an extension of machine learning, and machine learning is the study of techniques that let machines carry out various tasks without being explicitly programmed to do so.
Machine learning systems have three principal components: an input, a node/neuron, and an output. The input is the data that is being fed into the machine learning system, and the node/neuron represents one type of mathematical algorithm that will manipulate this information. Finally, the output of the system is the network's decision or inference about the data after it has been manipulated by the algorithm. These three components represent a simple neural network.
A deep neural network is the term applied to many simple neural networks that have been linked together. Deep neural networks have multiple layers, where the output of the first layer becomes the input of the second layer, and the output of the second layer becomes the input of a third layer, and so on. A neural network operates by analyzing the relevant features of the input data, detecting patterns in that data, and then making predictions about that data or similar data. The deeper a neural network is, the more layers it has, the more complex the pattern it can distinguish.
The term "deep learning" just refers to using deep neural networks to analyze data, discern the relevant patterns in the data, and make predictions about the data.
Object detection is a technique used in the field of computer vision, and computer vision refers to getting a computer to analyze an image and distinguish the image’s content in a manner similar to humans. Humans look at an image and recognizing individual objects and regions within that image. Without object detection, computers would not be carry out the incredibly complex tasks that relate to computer vision. They would not be to recognize people, animals, or other objects in an image, and therefore we wouldn't be able to do things like create autonomous vehicles, carry out facial recognition, or have robots interact with objects.
There are three different steps to object detection in machine learning computer vision: pattern recognition, feature extraction, and classification.
Pattern recognition is the stage where the deep neural network analyzes the entire image. The goal of pattern recognition is to discern relevant patterns within the image and memorize those patterns so that feature extraction can be done.
Feature extraction refers to breaking the general patterns that have been found down into distinct features. The larger patterns and shapes that exist within the image are broken up into smaller regions and features, and the network will focus on the features or patterns it believes are important, ignoring other parts of the image.
The final step is classification. After the notable features of an image have been extracted, the network joins these features together into the shape of an object, or multiple objects. Once this representation of an object has been created, the amalgamation of edges and shapes is compared against the network's knowledge of objects. As an example, if the network is provided with an image of the cat, the features of the cat, like the shape of the head, the whiskers, the eyes, and the fur will be used to recognize the subject as a cat, by comparing it to other already known images of cats.
When doing object detection with deep learning techniques, one of two different approaches is typically used. Either a custom object detection system is created from scratch, or a pre-trained object detector is used and simply tweaked slightly for the user's needs.
The advantage of creating a custom object detection network is that, when compared to pre-trained object detectors, they are typically more accurate, resulting in better detection and classification. The drawback to this approach is that a very large labeled data set is necessary to train the custom network, and the process of labeling the data, as well as manually selecting parameters for the deep neural network can be very time intensive.
In contrast to creating a custom object detector, a pre-trained deep neural network can be used, an instance of transfer learning. Pretrained networks come with a considerable advantage, the architecture as well as most of the weights, have already been selected and trained. This dramatically reduces the time needed to build an object detection system, but performance can be weaker when compared to a fully customized deep neural network.
There are two different types of networks used to train object detection systems. These are the Two-Stage network and the Single-Stage network.
R-CNN and variations on R-CNNs are two-stage object detection networks. Two-Stage networks operate by detecting region proposals, sections of the image which can potentially hold an object within them. This is the first stage of the network, while the second stage of the network is responsible for classifying the objects found within the proposed regions. The advantage of a Two-Stage Network is that they are typically much more accurate than Single-Stage networks, however they are also noticeably slower than Single-Stage Networks.
In contrast, Single-Sage networks are networks that make their predictions for an entire image, rather than splitting the image up into discreet sub-regions. Single stage networks tend to utilize versions of the You Only Look Once or YOLO algorithm, and they use tools called anchor boxes to accomplish this. The anchor boxes contain the genuine, ground-level truth about the location of the object, and they are compared with the predicted bounding boxes for the object. The advantage of using a Single-Stage Network is that they are typically much faster than Two-Stage networks, but they are often less accurate and can have trouble detecting small objects in particular.
It’s also possible to do object detection using standard machine learning algorithms, instead of using deep neural networks. Some of the most commonly used machine learning algorithms for object detection include Aggregate Channel Features, the Viola-Jones algorithm (primarily used when detecting humans, for detection of the upper body and face) and the SVM classification algorithm, which utilizes histograms of oriented gradient features.
Much like deep learning object detection frameworks, you'll begin by using one of two methods: creating a customized object detector or using a pre-trained object detector. While deep learning object detection frameworks offer support for automatic feature selection, when using a plain machine learning object detection frameworks, you'll need to select the identifying features yourself.
Deep learning object detection systems typically perform better than regular machine learning approaches. However, if processing power is a substantial limiting factor and if you don't have a lot of labeled training images to work with, you may wish to use a machine learning approach instead.