top of page

Beyond Unimodal: Embracing Multimodal Deep Learning for Enhanced AI

Authors: Sérgio Moisés Macarringue, Nilesh Barla


Published on: May 19, 2023


There are five basic human senses: hearing, touch, smell, taste, and sight. We can perceive and understand the world around us by possessing these five modalities. Thus, “multimodal” means to combine different channels of information simultaneously to understand our surroundings.


Using the human learning process as a role model, artificial intelligence (AI) researchers also try to combine different modalities to train deep learning models.

Multimodal deep learning is a subfield of machine learning that combines multiple sources of data or modalities, such as images, text, audio, and video. The goal of multimodal deep learning is to create models that can process and integrate these modalities in order to improve performance on various tasks. These tasks can include natural language processing, computer vision, speech recognition, and many others.

It involves training deep neural networks to process and integrate different types of data, in order to improve performance on tasks that require a more comprehensive understanding of the input data.


Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., & Gong, B. (2021). VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. ArXiv. /abs/2104.11178

In traditional deep learning, neural networks are trained on a single type of data, such as images or text; this is known as unimodal. However, in real-world applications, data often comes in multiple modalities, and the combination of different types of data can provide a more complete understanding of the task at hand.


Source: Author

How Multimodal deep learning work?

Multimodal deep learning works by combining information from multiple modalities, such as images, text, audio, and video, to create a joint representation of the data. This joint representation is then used to perform various tasks, such as classification, generation, or retrieval. The overall task can mainly be divided into three phases — individual feature learning, information fusion and testing.


We’ll need the following:

  • At least two information sources

  • An information processing model for each source

  • A learning model for the combined information

Given these prerequisites, let’s take a look at the steps involved in multimodal learning in more detail. The process of multimodal deep learning typically involves the following steps:

  • Data Preprocessing: The data from each modality is preprocessed to extract relevant features. This can involve tasks such as image and speech recognition, natural language processing, and other techniques specific to each modality.

  • Feature Extraction: Each modality is fed into a separate deep neural network to extract features. These features can be in the form of embeddings, which represent a compact and semantically meaningful representation of the input data.

  • Modality Integration: The features from each modality are combined to create a joint representation of the data. This can be done through techniques such as early fusion, where the features are concatenated before being fed into a final classifier, or late fusion, where the output of multiple classifiers is combined at a later stage.

  • Task-specific Training: The joint representation is used to perform a specific task, such as classification or retrieval. The deep neural network is trained to optimize for the task-specific objective using labeled data.

Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., & Hoi, S. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. ArXiv. /abs/2107.07651

When we build the model we must keep in mind that we generate the feature representations of the different modules. Let's say if we want to understand a joint representation of image and text together, we can then build two separate modules that transforms text into text embeddings and image to image embeddings. Once the two embeddings are created we can then concatenate them together.


In the example given below we will be using BERT to generate text embeddings.


import torch
import transformers

# Load the pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
model = transformers.BertModel.from_pretrained(model_name)
tokenizer = transformers.BertTokenizer.from_pretrained(model_name)

# Define a function to create text embeddings using BERT
def create_text_embedding(text):
    # Tokenize input text
    input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])

    # Obtain the BERT model's output
    with torch.no_grad():
        outputs = model(input_ids)

    # Extract the final hidden states from the BERT model
    hidden_states = outputs[0]

    # Compute the text embedding by averaging the hidden states
    text_embedding = torch.mean(hidden_states, dim=1)

    return text_embedding

Similarly, to generate image embeddings we can use any CNN or Transformer based models. Let's use VGG16 in this case.


import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load the pre-trained VGG16 model
model = models.vgg16(pretrained=True)

# Define the image preprocessing transformation
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define a function to create image embeddings using VGG16
def create_image_embedding(image_path):
    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")
    input_tensor = preprocess(image)
    input_batch = input_tensor.unsqueeze(0)

    # Move the input tensor to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_batch = input_batch.to(device)

    # Set the model to evaluation mode
    model.eval()

    # Disable gradient calculation
    with torch.no_grad():
        # Forward pass through the model
        features = model(input_batch)

    # Flatten the features
    image_embedding = torch.flatten(features, start_dim=1)

    return image_embedding

Now that we have created text and image embeddings we can then joint them.

# Load the pre-trained models
bert_model_name = 'bert-base-uncased'
bert_model = transformers.BertModel.from_pretrained(bert_model_name)

vgg_model = models.vgg16(pretrained=True)

# Define the image preprocessing transformation
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define a function to create text embeddings using BERT
def create_text_embedding(text):
    tokenizer = transformers.BertTokenizer.from_pretrained(bert_model_name)

    # Tokenize input text
    input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])

    # Obtain the BERT model's output
    with torch.no_grad():
        outputs = bert_model(input_ids)

    # Extract the final hidden states from the BERT model
    hidden_states = outputs[0]

    # Compute the text embedding by averaging the hidden states
    text_embedding = torch.mean(hidden_states, dim=1)

    return text_embedding

# Define a function to create image embeddings using VGG16
def create_image_embedding(image_path):
    # Load and preprocess the image
    image = Image.open(image_path).convert("RGB")
    input_tensor = preprocess(image)
    input_batch = input_tensor.unsqueeze(0)

    # Move the input tensor to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_batch = input_batch.to(device)

    # Set the model to evaluation mode
    vgg_model.eval()

    # Disable gradient calculation
    with torch.no_grad():
        # Forward pass through the model
        features = vgg_model(input_batch)

    # Flatten the features
    image_embedding = torch.flatten(features, start_dim=1)

    return image_embedding


Flow chart of the code above.

This type of models are usually trained using contrastive learning approach. Although it is computationally expensive but it is very effective and yields good results (if trained for longer period of time on a large dataset).


Types of Multimodal deep learning

There are several types of multimodal deep learning that have been developed to handle different types of multimodal data and tasks. Here are some of the most common types:

  1. Late Fusion: In this approach, each modality is processed by a separate deep neural network, and the resulting feature vectors are concatenated and fed into a final layer that produces the output. Late fusion is commonly used in image and text classification tasks.

  2. Early Fusion: Early fusion involves merging the input data from different modalities before it is fed into the neural network. This approach is used in tasks that require a joint representation of multiple modalities, such as speech recognition.

  3. Cross-modal Retrieval: Cross-modal retrieval involves finding the relationship between two different modalities. For example, given a textual query, the goal might be to retrieve relevant images. Cross-modal retrieval is commonly used in multimedia search and recommendation systems.

  4. Multimodal Sequence-to-Sequence: This approach is used in tasks where the input and output data are both sequential, such as speech-to-text or machine translation. The input sequence may consist of both text and audio, and the output sequence may be text.

  5. Multimodal Generative Models: Generative models can learn to generate new samples from multiple modalities, such as text and images. This approach has been used in applications such as image captioning and video prediction.


TY - JOUR AU - Huang, Shih-Cheng AU - Pareek, Anuj AU - Seyyedi, Saeed AU - Banerjee, Imon AU - Lungren, Matthew PY - 2020/12/01 SP - T1 - Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines VL - 3 DO - 10.1038/s41746-020-00341-z JO - npj Digital Medicine ER -

There are many variations and combinations of these types of multimodal deep learning, and researchers continue to develop new techniques and architectures to handle different types of data and tasks.


Late Fusion

Late fusion is a common approach used in multimodal deep learning to combine information from multiple modalities. In late fusion, the features extracted from each modality are processed by separate deep neural networks, and the resulting feature vectors are concatenated and fed into a final layer that produces the output.


The late fusion approach is typically used in tasks such as image and text classification, where the input data consists of both image and text modalities. For example, in a scene classification task, the input might be an image and a caption describing the scene, and the output might be a classification label for the scene category.


In the late fusion approach, the features from the image modality are extracted using a convolutional neural network (CNN), and the features from the text modality are extracted using a recurrent neural network (RNN) or a transformer network. The features from each modality are then concatenated and fed into a fully connected layer for classification.


Late fusion has several advantages over early fusion, which is another common approach for combining multiple modalities. In early fusion, the input from different modalities is combined before processing, which can lead to a larger input size and slower training. Late fusion allows each modality to be processed separately and can improve performance by allowing the neural networks to learn more complex representations.


However, late fusion has some limitations, such as the inability to capture the interactions between modalities at lower layers. To address this limitation, other approaches, such as cross-modal attention mechanisms and multimodal transformers, have been developed that allow for more direct interactions between modalities at lower layers.


Early Fusion

Early fusion is a common approach used in multimodal deep learning to combine information from multiple modalities. In early fusion, the input data from different modalities is concatenated into a single tensor and fed into a deep neural network for processing.


The early fusion approach is typically used in tasks such as image and video classification, where the input data consists of both visual and audio modalities. For example, in a video classification task, the input might be a sequence of frames and an audio clip, and the output might be a classification label for the video category.


In the early fusion approach, the input data from each modality is processed by separate networks to extract relevant features. For example, the visual data might be processed by a convolutional neural network (CNN), while the audio data might be processed by a recurrent neural network (RNN) or a transformer network. The features from each modality are then concatenated and fed into a fully connected layer for classification.


Early fusion has several advantages over late fusion, which is another common approach for combining multiple modalities. In early fusion, the input from different modalities is combined before processing, which allows for more direct interactions between modalities at lower layers. This can lead to better performance, especially when the modalities are highly correlated.


However, early fusion can also have some limitations, such as the potential for a larger input size and slower training due to the increased complexity of the input data. Additionally, early fusion may not be suitable for all types of multimodal data, particularly when the modalities have different dimensions or scales.


Overall, the choice between early fusion and late fusion depends on the specific task and the characteristics of the input data. Both approaches have their own advantages and limitations, and selecting the appropriate approach is an important consideration in developing a successful multimodal deep learning model.


Cross-modal Retrieval

Cross-modal retrieval is a task in multimodal deep learning where the goal is to retrieve data from one modality based on a query from another modality. For example, given an image as a query, the task may involve retrieving text descriptions or audio descriptions that correspond to that image. Conversely, given a text query, the task may involve retrieving images that correspond to the query.


Cross-modal retrieval is typically approached using a joint embedding space, where the data from different modalities are mapped into a common feature space. This allows for easy comparison and retrieval of data from different modalities based on their similarity in the joint space.


The joint embedding space can be learned using a variety of multimodal deep learning techniques, including Siamese networks, cross-modal hashing, and cross-modal attention mechanisms. In Siamese networks, the same deep neural network is used to extract features from data in each modality, and the resulting feature vectors are compared in the joint space to determine similarity. In cross-modal hashing, binary codes are generated for data in each modality, and these codes are compared to retrieve similar data in the joint space. In cross-modal attention mechanisms, attention weights are learned to weight the importance of each modality during feature extraction, and the resulting feature vectors are compared in the joint space.


Cross-modal retrieval has numerous applications in areas such as image and text retrieval, recommender systems, and multimedia information retrieval. It enables more efficient and effective retrieval of information across modalities and can help users to better access and utilize diverse sources of information.


Multimodal Sequence-to-Sequence

Multimodal sequence-to-sequence (seq2seq) is a type of deep learning model used for generating a sequence of outputs from a sequence of inputs across different modalities. The model can take in inputs from multiple modalities, such as text, image, and audio, and output a sequence of outputs across those modalities.


The multimodal seq2seq model is typically based on an encoder-decoder architecture, where the inputs from each modality are first encoded into a set of latent representations or feature vectors. The encoder is typically a deep neural network that processes the input sequence in each modality and produces a fixed-size feature vector that captures the relevant information in the input. The encoder for each modality can be a separate network or a shared network that processes all modalities.


The encoded feature vectors from each modality are then combined into a joint representation using a fusion mechanism, such as concatenation or multiplication. The joint representation captures the relationships between the different modalities and is used as input to a decoder, which generates the output sequence for each modality.


The decoder is also typically a deep neural network that takes in the joint representation and generates the output sequence one element at a time. The output sequence can be generated in a single modality or across multiple modalities, depending on the task.


Multimodal seq2seq models have been used in a variety of applications, such as speech translation, image captioning, and video captioning. They have shown to be effective in capturing the complex relationships between multiple modalities and generating accurate and informative output sequences. However, they can also be computationally intensive and require large amounts of training data to perform well.


Multimodal Generative Models

Multimodal generative models are deep learning models that can generate outputs across multiple modalities, such as images, text, and audio. These models are capable of learning complex relationships between modalities and generating diverse and high-quality outputs.



TY - BOOK AU - Lee, Michelle AU - Zhu, Yuke AU - Srinivasan, Krishnan AU - Shah, Parth AU - Savarese, Silvio AU - Fei-Fei, Li AU - Garg, Animesh AU - Bohg, Jeannette PY - 2019/05/01 SP - 8943 EP - 8950 T1 - Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks DO - 10.1109/ICRA.2019.8793485 ER -

One type of multimodal generative model is the multimodal variational autoencoder (MVAE), which is an extension of the traditional variational autoencoder (VAE) that can handle multiple modalities. In MVAE, the encoder network maps the inputs from different modalities into a joint latent space, and the decoder network generates output samples from this space. The MVAE model can learn to generate diverse and realistic outputs across multiple modalities, even for inputs that have never been seen before.


Another type of multimodal generative model is the generative adversarial network (GAN), which can also generate diverse and high-quality outputs across multiple modalities. In GAN, two deep neural networks, a generator and a discriminator, are trained simultaneously. The generator network generates samples from a random noise vector, while the discriminator network learns to distinguish between real and generated samples. The two networks are trained together in a minimax game, where the generator aims to produce realistic samples that can fool the discriminator, and the discriminator aims to correctly distinguish between real and generated samples. The resulting generator can produce high-quality samples across multiple modalities.


Conclusion

Multimodal deep learning is all about understanding and analyzing data from different sources. It's like putting together puzzle pieces from text, images, audio, and more to get a complete picture. There are different ways to approach multimodal deep learning, like merging the data early or late in the process, retrieving information across different modes, creating sequences that involve multiple modes, or even generating new multimodal content. Each approach has its own strengths and weaknesses, and they can be used in various situations. Multimodal deep learning is incredibly versatile, with applications in speech recognition, natural language processing, image and video recognition, robotics, and more. It's all about unlocking the power of combining different types of data to make sense of the world around us.


References

  • Pandeya, Y.R., Lee, J. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl80, 2887–2905 (2021). https://doi.org/10.1007/s11042-020-08836-3

  • Baltrusaitis T, Ahuja C, Morency LP. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019 Feb;41(2):423-443. doi: 10.1109/TPAMI.2018.2798607. Epub 2018 Jan 25. PMID: 29994351.

  • Dai, W., Dai, C., Qu, S., Li, J., & Das, S. (2016). Very Deep Convolutional Neural Networks for Raw Waveforms. ArXiv. /abs/1610.00087

  • Akkus, C., Chu, L., Djakovic, V., Koch, P., Loss, G., Marquardt, C., Moldovan, M., Sauter, N., Schneider, M., Schulte, R., Urbanczyk, K., Goschenhofer, J., Heumann, C., Hvingelby, R., Schalk, D., & Aßenmacher, M. (2023). Multimodal Deep Learning. ArXiv. /abs/2301.04856

  • TY - BOOK AU - Lee, Michelle AU - Zhu, Yuke AU - Srinivasan, Krishnan AU - Shah, Parth AU - Savarese, Silvio AU - Fei-Fei, Li AU - Garg, Animesh AU - Bohg, Jeannette PY - 2019/05/01 SP - 8943 EP - 8950 T1 - Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks DO - 10.1109/ICRA.2019.8793485 ER

  • Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., & Hoi, S. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. ArXiv. /abs/2107.07651

  • Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., & Gong, B. (2021). VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. ArXiv. /abs/2104.11178



bottom of page