Major Challenges In Image/Video Annotation

Understanding the Biggest Challenges in Image and Video Annotation

Image and video annotation techniques and computer vision, in general, have rapidly progressed over the past five years, and they can be used in many different fields. However, there are still many challenges that must be dealt with when annotating for images and videos. What are the major challenges when it comes to image and video annotation?

Photo: Hughesperreault via Wikimedia Commons, CC BY SA 4.0 

Major Challenges In Image Annotation

When it comes to challenges and image annotation, the biggest challenge is also a conceptually simple one: time. Annotating images is simply a complex, time-consuming task. Images have to be annotated with precision and careful attention to detail, which can take a long time, and the time invested grows with the number of images that require annotation. There are many variables requiring consideration when annotating images, such as edge cases and damaged images.

After time investment, and another major challenge for image annotation is image quality. The annotations/labels themselves need to be accurate, both in terms of location and classification. Confusion can occur when there are many similar objects in an image yet these images belong to different classes. Bounding boxes can also be misaligned, which can create confusion in the classifier. Misaligned semantic segmentation regions can be even more damaging to a classifier's performance as it affects all the pixels associated with a region rather than just the contents of a bounding box. These mistakes require the investment of more time to correct, and they can only be corrected by experienced professional annotators. These problems grow and compound the larger the image database is.

When it comes to semantic segmentation and semantic production, capturing the context of a scene proves to be a particular challenge. Regions of an image can be misclassified if an environment/context is misinterpreted. Object recognition for semantic prediction remains challenging as the context of the environment surrounding a particular object needs to be accounted for. Training a semantic segmentation network requires incredible accuracy when doing image annotation, as each and every pixel must be given a label.

Annotating images for face recognition/analysis still proves to be challenging. Detection is a different problem from recognition, and while simply detecting faces in images has become much easier in recent years, getting a classifier to recognize individual people or specific emotions is a tough problem to solve. After annotating faces and having the classifier detect them, recognition must occur. This involves proceeding to deeper levels of analysis like the classification/recognition of emotions, or the prediction of gender, and correctly annotating these concepts in images of people’s faces can be difficult. 

Major Challenges In Video Annotation

In general, video annotation proves to be more difficult than image a notation because of the sheer length and complexity of video in comparison to images. For this reason, video is typically converted into a series of smaller clips or GIFs and then annotated.

Objects move across time and space during a video, and as a result, applying the different types of image annotation to video proves more difficult than applying them to images. It is particularly difficult to apply semantic segmentation and instance segmentation to video, as the shifting of objects makes maintaining discrete regions harder, compared to maintaining regions in a still image. For this reason, bounding boxes and other forms of annotation such as line annotation or point annotation are frequently used instead of semantic segmentation when video annotation is required. 

What About Automated Image Annotation?

Because of the sheer amount of time and effort required to annotate images, autonomous and semi-autonomous image annotation techniques have been created. Algorithms that detect important features in an image and automatically annotate those images have been created.

The advantage of using automated image annotation software is that these methods let AI developers annotate many more images over a shorter time frame. However, the annotations produced by AI-based annotation devices are frequently less accurate than the annotations produced by humans, and they often require human corrections. Automatic image annotation software is frequently employed when the database of images needing annotation is extremely large, as it lets a user run a query for images and then have the system automatically annotate those images.

Meanwhile, semi-autonomous annotation tools are intended to enhance the efficiency of human annotators, doing things like suggesting tags and bounding boxes in images. Evolutionary algorithms and genetic algorithms are sometimes used to create autonomous/semi-automated image annotation software.

Improving The Quality Of Automatic Image Annotation

If automated image annotation techniques and human image annotators work in concert together, the speed and accuracy of annotations can be improved. AI-based annotation systems can be improved through annotations created by humans. 

Here’s an example of improving an AI-augmented annotation system that uses bounding boxes:

First, unannotated data can be fed into a pre-trained system to extract objects. Afterward, the unannotated objects can be annotated with bounding boxes by professional human annotators. The annotated data should then be passed into a machine learning algorithm, and after training the algorithm should have improved object detection. More undetected data can be fed to the system and the steps above can be repeated iteratively until the detector’s accuracy approaches peak performance. This optimal object detector can then be used to automatically label/annotate objects, and the resulting labels should require relatively few corrections.

Similar interplays between human annotators and AI detection methods could potentially be used to train other forms of automatic image annotation systems, like segmentation, line/spline, or point-based annotators.

Summing Up

The biggest challenge when it comes to image and video annotation is the sheer amount of time and effort needed to properly annotate media for use by computer vision applications. Images are time-consuming to annotate and video even more so. Maintaining quality image annotations while still annotating a large database in a timely fashion is a daunting task, although it is possible to make the process somewhat easier by augmenting human annotators with automated annotation systems. 

Because of the amount of time and resources that have to be invested in solving the major challenges of image/video annotation, it's a smart idea to have well-trained professional annotators handle the annotation of your dataset.