Image annotation is a complicated and time-consuming task. For this reason, automatic image annotation techniques have been created with the goal of reducing the amount of time needed to annotate images. Automatic image annotation techniques have their uses, but they also have their limitations, such as typically being less accurate than human made annotations. A closer examination of automatic image annotation techniques, as well as their advantages and disadvantages, will help make the appropriate use-cases for automatic image annotation clearer.
Before we get into how automatic image annotation is done, let’s make sure that we understand what image annotation is.. Image annotation is the process of adding metadata to an image, and this metadata assists a deep neural network classifier in interpreting the image and learning from it.
The metadata added to the image includes details about where the network should look for an object as well as the class/label of the object, which lets the classifier determine what the object is. Image annotation is done to enhance the performance of an image recognition system, with the added data allowing the classifier to better understand the features of an object within an image.
There are different types of image annotation. The various kinds of image annotation include:
Bounding boxes are the simplest type of annotation, and they simply surround objects with boxes that indicate where the classifier should look for the object at. Semantic segmentation assigns a label not to just an object in an image, but to every pixel that comprises a certain semantic region of an image. For instance, every pixel that comprises grass in an image is labeled as grass. Instance segmentation is the same principle as semantic segmentation, but a label is given to every specific instance of an object, instead of labeling all the objects that could potentially be defined by that word.
As previously discussed, the biggest reason that automatic image annotation techniques are employed is that manual image annotation is complex and time-consuming. Image annotation is complex because of all the variables that must be taken into account.
When doing image annotations, the correct labels must be assigned to the correct objects, and if any of the objects are mislabeled the classifier will be negatively impacted. It can be easy to mislabel objects when there are many visually similar, yet distinct objects within a dataset. For instance, when annotating fashion images, there may be items like jackets that look very similar yet have different pockets and are classified as different objects.
Annotations must also be of high quality, meaning they need to only cover the parts of the image that are relevant to the object being classified. If pixels are annotated as belonging to an object, but they don’t belong to that object, the performance of the classifier will suffer.
Pixels are often accidentally given too much/little weight when creating bounding boxes. If a bounding box is too loose, that means substantial portions of the image which don’t belong to the target object have been included in the bounding box. In contrast, a bounding box is too tight if portions of the target image have been left outside the bounding box. The bounding box must be just the right size in order to ensure optimal classifier performance.
Every annotation must be accurate and complete, meaning that the complexity of an image annotation task grows in proportion to the size of the dataset. Annotating images with both accuracy and speed can be difficult, and for this reason automatic image annotation systems are developed to ease the strain on human annotators.
Automatic image annotation is typically done with the utilization of algorithms that can distinguish semantic content. More specifically, the goal of automatic image annotation is to interpret the content of an image and compare it to images found within a database. When matches are found with the images in the database, the label of that image is applied to the content in the target image. To put that another way, semantic content information is mapped to visual content information automatically.
The algorithms used to distinguish features in a target image are the same as those used by most image recognition systems, Convolutional Neural Networks (CNNs). CNNs excel at interpreting image data, thanks to the convolutional sections of the network. These convolutional layers/regions interpret the pixel values within an image, extracting features relevant to the recognition of the image as they do so.
The convolutional layers are what actually extract the features from the image, while other functions called Max Pooling functions simplify the representations of the image. It’s important that representations of an image analyzed by a neural network are as simple as possible, because processing time and expense scales with the complexity of the image.
Max Pooling functions are capable of preserving just the features of the image that are relevant to the image’s recognition and classification, abstracting away other parts of the image in a process known as downsampling. This helps the convolutional network maintain efficiency when analyzing large numbers of images.
The main limitation of automatic image annotation devices is that the annotations they produce lack the quality of those produced by human annotators. Humans can easily distinguish between objects that share many similar features, whereas automatic annotation algorithms may be confused by these similar objects. Autonomous image annotation systems are also more likely to incorrectly place bounding boxes or give pixels the wrong label when doing semantic segmentation.
Despite the fact that autonomous image annotation algorithms are often less reliable than annotations produced by humans, these algorithms still annotate images faster than humans do. In order to combine the strengths of human annotators and computer annotation algorithms, semi-autonomous annotation systems are created. Semi-automatic systems help human annotators create annotations more quickly, though they are still able to use their human intuition to make sure annotations are accurate. Semi-automatic image annotations are usually faster than solo-human annotators and higher quality than regular than automatic annotations.