Image Annotation For Deep Learning And Computer Vision

Making Sense of Image Annotation: Deep Learning Architectures and Annotation Platforms

Most of the most powerful and transformative computer vision applications wouldn’t be possible without the utilization of deep learning and image annotation. Deep learning endows computers with the ability to recognize images and objects, while image annotation increases a network’s predictive power. While deep learning and image annotation are powerful tools for computer vision applications, it can be hard to know where to start with these tools. 

How does deep learning work exactly? What are the most popular deep learning architectures for computer vision? What image annotation tools work the best?

In order to successfully make use of deep learning and image annotation techniques, you’ll need to know the answers to these questions.

What Is Deep Learning?

Before we cover image annotation for deep learning, let’s take a moment to make sure we’re clear on what deep learning is. Deep learning is a subdiscipline of machine learning, and machine learning can be described as techniques that enable computers to carry out tasks without being explicitly programmed to do so. Machine learning systems often utilize neural networks, which analyze data and extract relevant patterns of information from them.

Simple neural networks are divided into three different components: an input layer, a hidden layer, and an output layer. The input layer is what takes the data into the network, while the hidden layers apply mathematical functions to transform the data. The output layer is where the transformed data is returned for analysis. More complex neural networks can be comprised of networks with more hidden layers in the middle, and each hidden layer can be conceived of as a simple neural network.

When many small networks are joined together into layers, a deep neural network is created. As neural networks get deeper, or gain more layers, they can distinguish more complex patterns. Deep learning systems can be made with various architectures, and different architectures have different specialties. 

Some of the different deep learning architectures are: Recurrent Neural Networks, Long Short-Term Memory networks, and Convolutional Neural Networks. 

Recurrent Neural Networks (RNNs) are called “recurrent” because they have loops within them, where information taken in by the neural network is passed to successive loops, or copies of the same network. The chain-like structure of a recurrent neural network means that they are well suited to processing lists and sequences with chronological ordering.

RNNs are useful for learning tasks where previous events or occurrences must be accounted for. Long Short-Term Memory (LSTM) networks are special versions of RNNs that use better optimization equations and tweaks to backpropagation in order to improve on the performance of regular RNNs. Both regular RNNs and LSTMs are used for tasks like language modeling and speech recognition.

Convolutional Neural Networks (CNNs) are architectures that are capable of processing two dimensional data, which makes them useful for interpreting image data. 

How Are CNNs Used In Computer Vision?

Convolutional neural networks are the most commonly used deep learning architectures in computer vision, and so most image annotation for deep learning will be done for a CNN. 

CNNs are used to classify images into a series of predefined categories/classes. Traditional deep artificial neural networks take a long time to process images because of the massive amount of data contained within them, and so CNNs have several components intended to reduce the complexity of the image data before passing it into the fully connected layers of the deep neural network.

Convolutional neural networks have four major components: 

  • Convolutional layers

  • Nonlinearities

  • Pooling layers

  • Fully connected layers 

Convolutional layers are responsible for making representations of the image being fed into the network. A small filter is passed over the image, and the fitler combines the pixel values with values maintained in the filter. The resulting values, derived from the combination of the filter matrix and the image values, represent the image and these values are passed onwards. 

The representation of the image created by the convolutional layers create is linear, and because images are nonlinear things (the relationship between input and outputs are nonlinear/nonproportional), nonlinear layers are used in the network. The nonlinear components enable the network to find relevant patterns within the representation of the image. Nonlinear activation functions include the Rectified Linear Unit (ReLU) and Tanh.

The function of the pooling layers is to simplify the data matrix, or make the network’s representation of the image simpler. The pooling layers work by analyzing different regions of the image and downsampling them, which reduces their size but preserves the information important to them. There are different types of pooling, like Average Pooling which takes the average of a set of numbers, but the most commonly used type of pooling is Max Pooling, which takes the greatest value in the set of values.

In the fully connected layers the data is flattened and the features of the data analyzed for relevant patterns. The fully connected layers are what extract the features from the image and apply an activation function to classify the image.

Deep Learning Computer Vision Architectures

Before doing image annotation for deep learning, you’ll want to select an appropriate deep learning architecture. Some of the most commonly used deep learning architectures for computer vision include: AlexNet, GoogLeNet/Inception, VGGNet, and ResNet. 


AlexNet is a deep learning framework comprised of five convolutional layers and three dense layers/fully connected layers. After every fully connected layer and every convolutional layer an activation function is added, a ReLu activation.  

An overlapping Max Pooling layer follows the first two convolutional layers, and following this Max Pooling layer the third through fifth convolutional layers are found. Finally, the output of these layers goes into the two fully connected layers at the end, which are fed into a softmax classifier. The softmax classifier has 1000 different class labels.

Unlike other networks, dropout is only included in the network directly before the first and second fully connected layers. There are around 62.3 million parameters in the architecture, and there are around 1.1 billion followed computational units in the forward part of the network. 


GoogLeNet is an architecture that makes use of Inception modules. Inception modules have multiple filter sizes on the same level, which makes the network wider instead of deeper. Once the outputs of the filters are calculated, the output is sent to the next inception module in the architecture. 

The inception modules have a 1 x 1 convolutional layer in the front of the module, which helps reduce the number of input channels, and it is found before the pooling layer. In the GoogLeNet architecture, nine of these inception modules are stacked linearly together, which makes for 22 layers total, or 27 layers if the pooling layers are included. Global Average Pooling is used instead of Max Pooling.


VGGNet achieved incredible accuracy on the ImageNet dataset, owing to its deep architecture. There are multiple variations of the VGG architecture, but VGG16i s commonly used. In VGG16 there are 1 different layers, thirteen of which are convolutional layers that have 3 x 3 filters. The padding and stride of the convolutional layers in VGG are set to 1 pixel, and all the convolutional layers are subdivided into 5 groups. The five groups in the convolutional layers each have a max-pooling layer that follows the group. The VGG Net series performs very well at extracting features from images. 


ResNets are an architecture designed to deal with the vanishing gradient problem, where gradients begin to shrink to zero after the gradients are calculated too many times, and the network ceases learning. ResNets deal with the vanishing gradient problem by using “skip connections”. Skip connections are also referred to as gated recurrent units. 

ResNets are made out of several different “blocks” joined together. As the ResNet gets deeper, the number of operations in a block increase, but the number of layers within a block remain constant. Every operation is a convolution, batch normalization, and a ReLu activation. The convolutional layers in the network have a fixed 3 x 3 dimension, and they have only four different feature map dimensions: 64, 128, 256, and 512. The input is bypassed every two convolutions and the width and height of the dimensions remains constant across the entire layer.

These different architectures have different specialties and optimal use cases.For semantic segmentation in particular, a VGG or ResNet architecture is typically used as the feature extractor, or the encoder of the network.

Image Annotation Platforms

There are a variety of different image annotation platforms you can use to prepare your data for deep learning and computer vision. Some of the most notable image platforms are: LabelImg, LabelBox, VGG Image Annotator,, and RectLabel.


LabelImg and open source and comes with Windows binaries, making it easy to install and set up. While LabelImg is easy to use, it only supports the creation of boundary boxes and no other forms of annotation. 


LabelBox is web-based and allows you to import the predictions from your model so that you can compare your human generated predictions with model generated predictions. It has many different tools to help you annotate your images, such as tools to create polygons, lines, points, and a semantic segmentation brush.

VGG Image Annotator

VGG Image Annotator is also open source, and while it doesn’t offer much in terms of project management options, it can be used right from your browser as an online interface. VGG Image annotator supports video annotation as well as image annotation and it has supports for polygons, lines, and circles in addition to bounding boxes. 


RectLabel is compatible with MacOS and it allows you to create not only bounding boxes but segmented regions. It has a tool called Core ML that will automatically create labels for certain images. RectLabel allows you to use CSV, JSON, KITTI and YOLO formats as export options for your images.

Whenever you are creating a computer vision application that relies on annotated data, be sure that you use the proper image annotation tools and techniques. In addition, be sure that the architecture you choose is the optimal architecture for your computer vision task, different architectures will transform data differently, and this impacts the performance of your application.