You might have heard that machine learning algorithms are only as good as the data they are trained on. This reflects the fact that the data provided to the algorithm will determine what patterns the algorithm learns, and thus what content it may correctly recognize in the future. To quote a well known concept in computer science: “Garbage in, garbage out!”Consequently, it is important to use quality image datasets for the creation of image classification and computer vision systems. But how do you assess a dataset’s quality?
Thankfully, there are a number of image datasets available that have been used extensively for the training of impressive machine learning systems. These tried and tested datasets ensure that your image classification application is trained using high quality data.
The following image datasets have proven themselves useful and reliable. Note that they are listed in no particular order. For some of these datasets subtle variations exist that are available for use as well. If there are different versions of a dataset, they will mentioned.
Despite its simplicity, MNIST is one of the most popular datasets for machine learning applications. MNIST is a collection of 70,000 images of handwritten digits. It is frequently used as a sanity-check database as it is easy to use and allows to gain some quick insight regarding the performance of an image recognition algorithm.
COCO contains more than 200,000 labeled images, with 330,000 images in total. COCO stands for Common Objects in Context and contains more than 80 object categories within their natural context. The dataset supports object segmentation, context recognition, and superpixel segmentation.
ImageNet is one of the most popular datasets for the training of new algorithms, thanks to its sheer size and high variation. The dataset is organized in accordance with the WordNet hierarchy and contains over 100,000 phrases, with every phrase in the WordNet hierarchy being represented by roughly 1,000 images. In total there are more than one and a half million images. ImageNet constitutes the gold standard to which other image databases are compared to.
Google’s Open Images dataset is a massive repository of links to over 9 million images that have been annotated using more than 6,000 class labels and what is more, many images are under the Creative Commons license. The 9 million training images, 40,000 validation images, and 125,000 test images have all been annotated using bounding boxes.
The CIFAR-10/CIFAR-100 datasets are small datasets, both in terms of number of images within the dataset and in terms of the size of the images themselves as they are only 32 x 32 pixels. The CIFAR-10 dataset includes 60,000 images covering 10 different classes and was published first. The CIFAR 100 dataset extends it, and offers 100 classes with 600 images each. These datasets are regularly used as sanity-check datasets because of their versatile usability and the easy processing. Because of the small size of these datasets researchers can try different algorithms in a timely manner using these datasets. However, these datasets are not useful for the training of a sophisticated model.
The developers of the Fashion MNIST dataset believed that the MNIST set was overused. Therefore, this dataset was created as a replacement for the MNIST. All images in the fashion MNIST dataset are grayscale images that are labeled with 10 different classes. There are 60,000 training images and 10,000 test images in this dataset.
The Caltech 101 and 256 datasets are used frequently due to the wide array of categories that are covered by them and the decent representation that is given for each class. Every class has around 50 images. The Caltech 101 has 101 image classes and around 5,000 images in total, while the 256 set includes 256 classes with over 30,000 images in total.
LabelMe was created by the Computer Science and Artificial Intelligence Lab at MIT. LabelMe consists of almost 190,000 images, with over 60,000 annotated images. Moreover, there are more than 658,000 labeled objects in the dataset. This extensive labeling helps algorithms to recognize multi-faceted classes of objects. LabelMe contains diverse images with complex annotations, all of which are non-copyrighted. LabelMe also comes with a tool that allows users to annotate their own images and post them to the LabelMe database.
The COIL-100 dataset was compiled by the Colombia University Computer Science department, and consists of images of 100 different toys. The toys were placed on a turntable and images were captured at different angles. Over 7,200 images that show the 100 toys are available, with each toy being photographed in 72 poses.
The Labelled Faces in the Wild (LFW) dataset was particularly designed for unconstrained face recognition tasks. The database was compiled by the University of Massachusetts and more than 5,700 unique faces were represented by more than 13,000 images in the dataset. The images have been aligned using the Viola-Jones face detection method, and the images have been deep-funneled to improve results for mostf face recognition algorithms.
While the image datasets listed above are widely used and have demonstrated their effectiveness, they are mainly general-purpose image datasets. You may find that your problem requires a more specific image dataset. In that case, it is important to know where to look for such datasets. Here are some valuable resources:
1. Kaggle - Kaggle is a data science website that collects datasets that are contributed by its users from all around the world. Kaggle is home to many well known but also to some unique image datasets. The various datasets include flowers, X-rays, fruits, blood cells, dogs, and many more. Most datasets on Kaggle are publicly available, although some have special terms and conditions you need to abide in order to get access to the dataset.
2. UCI Machine Learning Repository - A large repository of datasets maintained by the University of California, Irvine. Most of the datasets available on this site are clean and require very little preprocessing. No registration is needed to download the datasets from the repository.
3. Google Cloud Public Dataset - A repository of free public datasets hosted in BigQuery and available through Google’s cloud platform. Image datasets available here include the 2017 Eclipse Megamovie Dataset and the Met Public Domain ArtWorks dataset.
4. VisualData - VisualData describes itself as a search-engine for image data, enabling the user to search for image datasets by category and filter their search results with a variety of options.
1. MNIST - MNIST has been made available under the Creative Commons Attribution-Share Alike 3.0 license.
3. ImageNet - Users of ImageNet must abide by the licenses for the individual image URLs. Most images come from Flickr or other open source equivalents.
4. Open Images Dataset - Annotations are available under CC BY 4.0, while most images are available under Creative Commons, some are not. There is licensing information attached to individual images.
CIFAR-10/CIFAR-100 - The CIFAR sets are open and free for anyone to use. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, and is part of the 80 million tiny images dataset created by MIT.
Fashion-MNIST - The Fashion MNIST dataset is available to everyone free of charge and without limitation.
CalTech 101/256 - The CalTech datasets are available for use for any purpose, though the dataset organizers should be cited.
LabelMe - Any images uploaded to LabelMe are considered public domain and available for use without restrictions, though use of the dataset should be cited.
COIL - The COIL dataset is intended for non-commercial uses and should be cited according to the instructions here.
Labeled Faces In The Wild - This dataset is available for use without restrictions, although the different version of the dataset have different citation requirements.