Weakly Supervised Learning Unlocks Medical Imaging Insights

September 2, 2021

Becks Simpson

3d rendering ai robot analyze x-ray brain tomography

Image Source: Phonlamai Photo/Shutterstock.com

By Becks Simpson for Mouser Electronics

Published September 2, 2021

Artificial Intelligence (AI) has advanced medical diagnostics from images by detecting and measuring abnormalities faster and more accurately than human experts. Building high-quality AI models that generalize across populations is imperative to improving patient outcomes and personalized treatments. However, AI models have recently required a vast amount of data, and intricate dataset labels from which machines can learn.

Today, a branch of Deep Learning (DL) known as weakly supervised learning is helping physicians garner more insights with less effort by reducing the need for complete, exact, and accurate data labels. Weakly supervised learning works by leveraging more readily available coarse labels—such as at the image level rather than segmentations of interest within the image—and allows pre-trained models and common interpretability methods to be used. In the following, we’ll examine how managing data plays a role in weakly supervised learning.

Labeling is Difficult in Medical Imaging

Labeling images is particularly challenging in the medical industry. To begin, labeled data is both limited and hard to come by because medical images and data about the results/outcomes are generally stored in separate systems. For example, images from computed tomography (CT) or magnetic resonance imaging (MRI) might be available in hospital data, but the results of biopsies or tumor removals are typically stored in a pathology lab—which is often a private clinic outside the hospital. Although it is possible to reconcile data and labels for some pieces of data, accessing and aggregating data can become quite time-consuming, especially when more than one private clinic is involved.

Additionally, finding and labeling signs of disease and its progress—called biomarkers—within images has been notoriously time-consuming and complex because data must be labeled pixel by pixel, resulting in thousands of labels. This is especially true in applications where an algorithm is expected to segment areas of an image or produce specific localizations of a region, such as a lesion or surgical boundaries. This is often costly because expert knowledge is usually required and labels are needed in three dimensions, as with MRI and CT image volumes. Add these two downsides together, and it becomes an expensive exercise to generate labels for imaging data. This also limits the likelihood of being able to outsource the labeling process.

Because of the expertise needed, the quality of the labels can vary and affect the final performance of the DL model. Accuracy of labels is one issue here. Commonly, less-experienced radiologists or medical residents must annotate data for training. Results are not as accurate compared to a clinician with decades of experience doing the work. Inter-reader and intra-reader variability also come into play. The former describes how annotations between readers will differ slightly. The latter referred to the instances when a single reader asked to segment an image at two different points in time will also produce slightly different results.

Finally, human labeling can limit results as well. One benefit of machine learning is that the model can derive insights that humans never could, and constraining labels to what humans input potentially limits results. For instance, the AI would only learn to replicate what humans think for certain tasks, meaning they can unintentionally reproduce a particular human's bias. Additionally, other features in other areas of the input data can be predictive but discarded because they do not fall directly within the selected region of interest. For example, indications of disease might be evident in surrounding tissue or a different organ nearby.

Leveraging Weakly Supervised Learning

In these cases, it is often more beneficial to use a coarser label such as whether an image contains cancer or some other disease of interest and allow the model to find the most discriminative features (Figure 1). This is where weakly supervised learning comes in.

Figure 1: Example of automated annotation using weakly supervised learning where the AI found predictive features that pathologists did not detect. (Source: Pathology Informatics Team, RIKEN Center for Advanced Intelligence Project)

Weakly supervised learning describes the branch of DL that aims to reduce the number of labels or level of detail required to produce a well-performing DL model. This approach can be roughly separated into three main categories: Incomplete, inexact, and inaccurate labels. The word “roughly” is used here because multiple labeling approaches can be used in a single dataset and because weakly supervised labeling aims to help with any combination as required:

Incomplete labels generally arise when part of the dataset is labeled, and the rest is not.
Inexact labels include using the overall outcome for an image without needing to segment the specific region of interest.
Inaccurate labels, which stems from humans' lack of expertise and the ambiguity or uncertainty between certain disease indicators.

Interestingly, inexact labels can be more useful than incomplete or inaccurate labels if a coarser, more readily available label can be used to produce good results. Inexact labels are easier to get right because they don’t require the same level of detail as other labels, and they are often easier to obtain, such as extracting cancer stage as a label from a report to indicate a scan has cancer in it as opposed to manually highlighting the cancerous regions on a 3D imaging. With inexact labels, the dataset will likely have more labels available and at a higher level of accuracy. In particular, this reduces the need for a high level of expertise to highlight all relevant pixels. It improves label accuracy because it’s easier to give a binary answer than to detail all the features contributing to an outcome.

A popular way to leverage such inexact labels for the most common medical-imaging use cases such as detecting and localizing regions of interest uses a two-step process:

The backbone, such as training a DL model to predict the classes described by the inexact labels.
The use of pixel-attribution methods—also known as saliency or interpretability methods—to highlight the most relevant regions for the model’s decisions once it is developed to predict on particular scans.

Figure 2 illustrates examples of the different gradient-based pixel attribution methods.

Figure 2: Two input images (goldfish and bear) with examples of the gradient-based pixel attribution methods available for performing segmentation during weakly supervised learning. (Source: TF Keras Vis on Github)

Convolutional Neural Networks as the Backbone

Because the medical use cases very often use imaging data, it’s no surprise that Convolutional Neural Networks (CNNs) are the primary DL framework used as the basis for weakly supervised learning. CNNs work by learning to reduce the millions of pixels in a medical scan—typically reducing a three-dimensional volume to a low-dimensional representation—and then mapping that to class labels.

In weakly supervised learning, it is possible to combine approaches. For example, a new network could be trained on your dataset (which affords benefits of other similar data sources). A pre-trained network could be used to perform transfer learning on the new task. For example, ResNet50 and VGG16 are two CNN architectures trained on millions of images found in everyday life. Although they were not trained on medical images, they can still be useful because the convolutional filters learned in the earlier layers of the model tend to be generic features such as lines, shapes, and textures, which are useful for medical imaging.

Using one of these models for transfer learning is as simple as removing the final class prediction layer and reinitializing it with a layer representing the classes for the new medical imaging task. Even though the end goal is to have outputs that highlight the relevant objects and regions of interest in the images, the first step is merely to predict if those regions of interest exist in the image in the first place.

AI Interpretability for Weakly-Supervised Localization

Once the DL backbone is trained and can predict the classes of interest with good accuracy, the next step would be to use one of the many AI interpretability methods to produce segmentations of the regions of interest. These interpretability methods—also called pixel attribution methods—were developed to gain insight into what a deep-learning model was looking at in an image when it made a certain prediction. The output is some form of visualizations—often called saliency maps—which can be calculated in several different ways depending on the end goal.

One of the most popular approaches is using gradient-based saliency maps. At its core, this involves taking the output prediction and inspecting all the neurons that made up this output. Depending on the method, this inspection can go all the way back to the first input layer—Vanilla Gradients. Or it could stop at some later layer such as the last convolutional layer in the neural network architecture—GradCAM (Figure 3). Other variations do different things such as produce smoother regions of interest, improve the limitations of simpler variations, or generate tighter segmentations around the desired features.

Figure 3: GradCAM, an ML interpretability method that can be used to segment features in weakly supervised learning, takes the gradients of the output class concerning the last convolutional layer. (Source: Zhou et al from Computer Science and Artificial Intelligence Laboratory, MIT)

Conclusion

Until recently, identifying biomarkers in medical images required large volumes of intricately labeled imaging data. However, techniques such as weakly supervised learning reduce the need for complete, exact, and accurate data labels and unlocking insights that were too costly in time and expertise to attain. Weakly supervised learning works by leveraging more readily available coarse labels—such as at the image level rather than segmentations of interest within the image. It allows the reuse of pre-trained CNN models and then uses common interpretability methods to highlight regions of interest based on the predicted class. These two points allow models trained on medical imaging data for various applications without extensive, pixel-level annotations. This saves time and money and potentially uncovers predictive features previously unknown to clinicians, which can improve diagnostic accuracy and patient outcomes.