Artificial Intelligence Image Recognition

Learning What Makes Artificial Intelligence Image Recognition Happen

Have you wondered how artificial intelligence driven image recognition works?There are two ingredients to image recognition with artificial intelligence: neural networks and image data. Let’s take a close look at both of these concepts individually, and then see how they come together to accomplish image recognition.

Photo: via Flickr, CC BY SA 2.0 (

Deep Learning And Neural Networks

Most of the recent advances in artificial intelligence image recognition and computer vision generally are driven by deep learning systems. Deep learning is a subset of machine learning, which is a type of artificial intelligence. Machine learning algorithms take in data, analyze it to extract relevant patterns, and then make a prediction about new data. Deep learning algorithms can apply this general principle to more complex forms of data, like images and video. Without deep learning algorithms training artificial intelligence systems to recognize images and drive automated vehicles would be very difficult or impossible. 

Deep Learning systems use neural networks to function, and neural networks are comprised of a series of algorithms layered next to one another, or arrayed in such a fashion that each algorithm is contingent on the output of the algorithms surrounding it. This process attempts to emulate the type of information processing that is found in the human brain. There are various neural network architectures that deep learning systems use, and different architectures have different specialties. For the purposes of image recognition, the kind of neural network used is a Convolutional Neural Network or CNN.

Convolutional Neural Networks: Image Learning Architectures

The function of a Convolutional Neural Network is to extract patterns from images. When we look at an object, our brains recognize the object by detecting relevant shapes and patterns within the object. A Convolutional Neural Network (CNN) functions similarly to this. 

A CNN takes an image as its input and it coverts the image into a series of numbers, with values that represent the pixels in the image. It does this by applying a series of filters to the image, with each filter forming a representation of a portion of the image. These filters are then merged to create a complete representation. The CNN can then take this representation and make a prediction about its contents. Neural networks must be trained, fed a series of images and allowed to learn about them for a time, in order to make accurate predictions. 

Creating Image Datasets

The other portion of the equation, apart from the use of deep learning architectures, when it comes to creating image recognition applications, is the creation of image datasets. Deep learning architectures and algorithms are only as good as the data that support them. 

Image recognition has gotten much better in recent years largely because of datasets that are larger and more complex than previous datasets. ImageNet is one of the most widely used image datasets for the training of AI image classifiers, and it consists of over 3.2 million images making up many different classes. These images are used to train AI models, and because of the variety of images within the database, classifiers are capable of recognizing more types of objects.

Recognizing The Images In The Dataset

Now that we’ve looked at the ingredients necessary for the creation of an image recognition AI, let’s see how these ingredients come together into an image recognition system.

The steps involved in the process of image recognition are: the extraction of features from the image, the analysis of these features, classification of the image based on these features.

To begin with, convolutional neural networks move a “filter” over the image that is being input into the network. This filter examines the pixel values of the image and combines them with the values contained in the filter itself, and a numeric representation of that portion of the image is created. The filter is then moved across the rest of the image until a complete representation of the image has been created. This is the function of the convolutional layer. 

Next, the information that represents the image must be transformed. lmages are nonlinear things, so activation functions are used to transform the representation of the image (which is linear) into nonlinear data. These nonlinear activation functions enable the rest of the network to continue analysis of the data and the extraction of features. There are a variety of nonlinear activation functions, such as TanH, the Rectified Linear Unit and Sigmoid. However, the Rectified Linear Unit or (ReLU) is the most commonly used activation function.

After the nonlinear activation function is employed, the data is passed into another layer of the CNN. The next layer is called a pooling layer. Images contain large amounts of data, and analyzing the image can take a long time. Pooling layers function to reduce the amount of processing time an image requires, accomplished by simplifying the representation of the image. The pooling layer gets its name from the fact that it selects a small region of the image, just a few pixels on a side, and chooses a value to represent that region. The effect is that the image is “downsampled”, or scaled down, but the critical information that represents the image is maintained. This process is repeated for the entire image.

There are numerous pooling functions that can be used to downsample an image, such as Average Pooling and Max Pooling. Average Pooling captures the average value of a region, while Max Pooling captures the maximum value of a region of pixels. In practice, Max Pooling tends to work best and therefore it is the most commonly used pooling function.

Fully Connected layers

In the fully connected layers every neuron receives an input from every neuron in the previous layer. The fully connected layers are what actually does the discriminative learning in the CNN, or to put that another way they are what actually learns the patterns relevant to the image and enables classification. The fully connected portion of the CNN learns to discriminate between classes by adjusting the weights of the inputs, and the weights are just assumptions about how each input is related to the class. 

In a convolutional neural network, the image data from the convolutional layers must be flattened heading into the fully connected layers, which compresses the data down into a single long feature vector that can be used by the fully connected layers.

Image Annotation

When preparing a dataset to fed into a Convolutional Neural Network, the image data can be annotated. Image annotation is the process of adding metadata to the image, which can provide the CNN with extra information that helps it distinguish objects from one another. 

The bounding box is the most common type of image annotation, and it defines where in an image the learning algorithm should look for an object, in addition to telling the algorithm what that object is. Other types of annotation include point annotation, which tracks specific points in an image, and semantic segmentation which assigns every pixel in a semantically defined region a class. Image annotation can greatly improve the accuracy and performance of an image classifier.