Self-Supervised Learning in Computer Vision: Image Classification
By: Johan Fernandes, Statistics Canada
Introduction
Computer Vision (CV) comprises tasks such as Image Classification, Object Detection and Image SegmentationFootnote 1. Image Classification deals involves assigning an entire image to one of several finite classes. For example, if an image contains a "Dog" occupies 90% of the space, then it is labeled as a "Dog". Multiple Deep Learning (DL) models using Neural Networks (NN) have been developed to accurately classify images with high accuracy. The state-of-the-art models for this task utilize NNs of various depths and widths.
These DL models are trained on multiple images of various classes to develop their classification capabilities. Like training a human child to distinguish between images of a "Car" and a "Bike", these models need to be shown multiple images of classes such as "Car" and "Bike" to generate this knowledge. However, humans have the additional advantage of developing context through observing our surroundings. Our minds can pick up sensory signals (audio and visual) that help us develop this knowledge for all types of objectsFootnote 2. For instance, when we observe a car on the road our minds can generate contextual knowledge about the object (car) through visual features such as location, color, shape, lighting surrounding the object, and the shadow it creates.
On the other hand, a DL model specifically for CV must be trained to develop such knowledge which is stored in the form of weights and biases it utilizes in its architecture. These weights and biases are updated with this knowledge by training the model. The most popular training process, called Supervised Learning, involves training the model with the image and the corresponding label to improve its classification capability. However, generating labels for all images is time consuming and costly, as it involves human annotators manually generating labels for each image. On the other hand, Self-Supervised Learning (SSL) is a new training paradigm that can be used to train DL models to classify images without the bottleneck of having well-defined labels for each image during training. In this work I, will describe the current state of SSL and its impact on image classification.
Significance of Self-Supervised Learning (SSL)
SSL aims to set up an environment to train the DL model to extract maximum features or signals from the image. Recent studies have shown that the feature extraction capability DL models is restricted when trained with labels, as they must pick signals that will help them develop a pattern to associate similar images with that labelFootnote 2Footnote 3. With SSL the model is trained to understand the sensory signals (e.g., shape and outline of objects) from the input images without being shown the associated labels.
Additionally, since SSL does not limit the model to develop a discrete representation (label) of an image, it can learn to extract much richer features from an image than its supervised counterpart. It has more freedom to improve how it represents an image, as it no longer needs to be trained to associate a label with an imageFootnote 3. Instead, the model can focus on developing a representation of the images through the enhanced features it extracts and identifying a pattern so that images from the same class can be grouped together.
SSL uses more feedback signals to improve its knowledge of an image than supervised learningFootnote 2. As a result, the term self-supervised is being used more frequently in place of unsupervised learning as an argument can be made that DL models receive input signals from the data rather than labels. However, they do have some form of supervision and are not completely unsupervised in the training process. In the next section I will describe the components needed for self-supervised learning.
These signals are enhanced through a technique known as data augmentation, in which the image is cropped, certain sections of the image are hidden, or the color scheme of the image is modified. With each augmentation, the DL model receives a different image of the same class or category as the original image. By exposing the model to such augmented images, it can be trained to extract rich features based on the visible sections of the imageFootnote 4. Furthermore, this training method removes the overhead of generating labels for all images, opening up the possibility of adapting image classification in fields where labels are not readily available.
Components of self-supervised learning methods:
Encoder / Feature Extractor:
As humans, when we look at an image, we can automatically identify features such as the outline and colour of objects to determine the type of object in the image. For a machine to perform such a task, we utilize a DL model, which we refer to as an encoder or a feature extractor since it can automatically encode and extract features of an image. The encoder consists of sequentially ordered NN layers, as shown in Fig 1.
An image contains multiple features. The encoder's job is to extract only the essential features, ignore the noise, and convert these features into a vector representation. This encoded representation of the image can be projected into n-dimensional or latent space, depending on the size of the vector. As a result, for each image, the encoder generates a vector to represent the image in that latent space. The underlying principle is to ensure that vectors of images from the same class can be grouped together in that latent space. Consequently, vectors of "Cats" will be clustered together while vectors of "Dogs" will form a separate group, with both groups of vectors distinctly separated from each other.
The encoders are trained to improve their representation of images so that they can encode richer features of the images into vectors that will help distinguish these vectors in latent space. The vectors generated by encoders can be used to address multiple CV tasks, such as image classification and object detection. The NN layers in the encoder would traditionally be convolutional neural network (CNN) layers as shown in Fig 1; however, the latest DL models utilize Attention Network (AN) layers in their architecture. These encoders are called Transformers, and recent works have begun to use them to address image classification due to the impact they have provided in the field of natural language processing. The vectors can be fed to classification models, which can be a series of NN layers or a clustering-based models such as K-Nearest Neighbor (KNN) classifier. Current literature on self-supervised learning utilizes KNN classifiers to cluster images, as they only require the number of clusters as an argument and do not need labels.
Data Augmentation:
Labels of images are not provided to encoders trained in a self-supervised format. Consequently, the representation capability of encoders has to be improved solely from the images they receive. As humans, we can look at objects from different angles and perspectives to understand the shape and outline of objects. Similarly, augmented images assist encoders by providing different perspectives of the original training images. These image perspectives can be developed by applying strategies such as Resized Crop and Color Jitter to the image, as shown in Fig 2. Augmented images enhance the encoder's ability to extract rich features from an image by learning from one section or patch of the image and applying that knowledge to predict other sections of the imageFootnote 4.
Siamese Network architecture:
Many self-supervised learning methods use the Siamese Network architecture to train encoders. As shown in Fig 3, a Siamese Network consists of two encoders that could share the same architecture (example: ResNet-50 for both encoders)Footnote 3. Both encoders receive batches of images during training (training batches). From each batch, both encoders will receive an image, but with different augmentation strategies applied to the images they receive. As shown in Fig 3, we consider the two encoders E1 and E2. In this network, image (x) is augmented by two different strategies to generate x1 and x2, which are respectively are fed to E1 and E2. Each encoder then provides a vector representation of the image, which can be used to measure similarity and calculate loss.
During the training phase, the weights between the two encoders are updated through a process known as knowledge distillation. This involves a student-teacher training format. The student encoder is trained in an online format where undergoes forward and backward propagation, while the weights of the teacher encoder are updated at regular intervals using stable weights from the student with techniques such as Exponential Moving Average (EMA)Footnote 3.
Contrastive vs Non-contrastive SSL methods:
All available SSL methods utilize these components, with some additional changes to improve each other's performance. These learning methods can be grouped into two categories:
Contrastive learning methods
These methods require positive and negative pairs of each image to train and improve the representation capability of encoders. They utilize contrastive loss to train the encoders in a Siamese network with knowledge distillation. As shown in Fig 4, a positive pair would be an augmented image or patch from the same class as the original image. A negative pair would be an image or patch from another image that belongs to a different class. The underlying function of all contrastive learning methods is to help encoders generate vectors so that vectors of positive pairs are closer to each other, while those of negative pairs are further away from each other in latent space.
Many popular methods such as SimCLRFootnote 4 and MoCoFootnote 5, are based on this principle and work efficiently on large natural object datasets like ImageNet. Positive and negative pairs of images are provided in each training batch to prevent the encoders from collapsing into a state where they produce vectors of only a single class. However, to train the encoders with negative pairs of images, these methods rely on large batch sizes (upwards of 4096 images in a training batch). Furthermore, many datasets, unlike ImageNet, do not have multiple images per class, making generating negative pairs in each batch a difficult, if not impossible, task. Consequently, recent research is leaning towards non-contrastive based methods.
Non-Contrastive learning methods
Methods such as DINOFootnote 3, BYOLFootnote 6 and BarlowTwinsFootnote 7 train encoders in a self-supervised format without the need to distinguish images as positive and negative pairs in their training batches. Methods like DINO continue to use the Siamese Network in a student-teacher format and rely on heavy data augmentation. However, they improve on contrastive methods with a few enhancements:
- Patches of images provide a local view of the image to the student and a global view of the image to the teacher encoderFootnote 3.
- A prediction layer is added to the student encoder to generate a probability-based outputFootnote 3. This layer is only used during training.
- Instead of calculating contrastive loss between pairs of images, the output from the encoders is used to calculate a classification type of loss, such as cross entropy or L2 loss to determine if the output vectors from the student and teacher encoders are similar or not Footnote 3, Footnote 6, Footnote 7, Footnote 8.
- Employing EMA or any moving average method to update the teacher network's weights from the online weights of the student network, while avoiding backpropagation on the teacher network Footnote 3, Footnote 6, Footnote 7, Footnote 8.
Unlike contrastive methods, these methods do not require a large batch size for training and do not need additional overhead to ensure negative pairs in each training batch. Additionally, deep learning (DL) models such as Vision Transformer, which have the capability to learn from the local view of an image and predict other similar local views while also considering at the global view have replaced conventional CNN encoders. These models have enhanced non-contrastive methods to surpass image classification accuracies of supervised learning techniques.
Conclusion
Self-supervised learning is a training process that can help DL models train more efficiently than popular supervised learning methods without the use of labels. This efficiency is evident in the higher accuracy that DL models have achieved on popular datasets such as ImageNet when trained in a self-supervised setup compared to a supervised setup. Furthermore, self-supervised learning eliminates the need for labeling images before training, providing an additional advantage. The future looks bright for solutions that adopt this type of learning for image classification tasks as more research is being conducted on its applications in fields that do not involve natural objects, such as medical and document images.
Meet the Data Scientist
If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.
Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email
Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!
Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.
- Date modified: