Fabián Souto

MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text

Fabián Souto Herrera — Fri, 24 Apr 2020 01:15:11 GMT

Keeping your knowledge up-to-date is very difficult in novel areas like deep learning. I know, I know... It's an amazing world but a very big one and fast moving too. Today I want to give you some insights on an advanced topic but where you can find some great novel ideas that you can use in your work or research.

Today we are going to talk about a (relative) new deep learning architecture, neural network or simply a model that performs classification over inputs with different nature (multi-modal), specifically images and texts in one-shot, i.e. a single pass through the network.

If you are not confident working with convolutional neural networks and attention models like transformer you'll probably feel a little overwhelmed, but don't scary! We all started like that. The first time that I read the "Attention is all you need" paper (Ashish Vaswani et al.) I didn't understand anything. I must read it at least three times to realize that I won't understand a fully new area (for me) with a single paper, I had to keep reading posts like this one, other papers and writing some code!

So don't be afraid, you only need to keep reading, being motivated and making things happen! At least, this post could give you some keywords that you can use to search related content and read more about this. One of the most important parts of learning is knowing what you don't know, so now I'm going to give you some keyword or key phrases to look for in Google and dig deeper!

Motivation

BERT has stolen the attention in the NLP landscape. Its amazing results (and the ones of the related models) are moving the entire field to use or at least try the transformer architecture. But the modern digital world is increasingly multimodal, textual information is often accompanied by other modalities like images or videos. Using the whole information can be very useful to increase the models performance. That was the objective of Douwe Kiela et al. and his team at Facebook AI Research. They develop a state-of-the-art multi-modal bitransformer (MMBT) model to classify images and texts.

Multimodal Bitransformer in simple terms

If you have worked with BERT you know that the inputs are the tokens of the texts, right? So how can we add the images? A naive but highly competitive approach is simply extract the image features with a CNN like ResNet, extract the text-only features with a transformer like BERT, concatenate and forward them through a simple MLP or a bigger model to get the final classification logits.

The authors argue that using the power of the bitransformer's ability to employ self-attention give the model the possibility to look in the text and the image at the same time, using attention over both modalities.

So, in simple terms, we are going to take the image, extract its features with a CNN and use that features as inputs (like tokens' embeddings!) for the bitransformer. That's what you can see in the image above, we give the model the sentence and the image as input to use attention over both modalities at the same time.

For me this was that kind of moment when you think "how I did not have this idea before?!". Is very very simple and at the same time very powerful.

That's all if you only wanted to know how this works. Now, for your next project that involves images and texts you can use this idea. If you want more details keep reading.

Advanced concepts

We know that the transformer model must be pre-trained in a self-supervised fashion and the image encoder (the CNN) must be pre-trained too. It's like a standard that (if you are not Google, Facebook or Nvidia that have hundreds or thousands of GPUs) you don't even think about training your own BERT model, you just apply transfer learning and fine-tuning to use them in your task. And obviously, that is what the authors did.

Image encoder

The authors used a ResNet-152 pre-trained over ImageNet. The network has a stride of 32 and generates a feature map with 2048 channels. This means that if you use a simple image of 320x320 pixels the output of the network will be a feature map of 10x10 and 2048 channels. In PyTorch notation you will have a tensor with shape (2048, 10, 10). This is the same for any of the ResNet models, the only difference is with the ResNet-18 that only generates 512 channels.

Over the feature map we can apply an adaptive average pooling to transform it and get HxW = K sections. In my implementation I used a final grid of (5, 5) that leads to 25 feature vectors, so we can think that our "image sentence" is composed of 25 "image section embeddings", but you can use the final size that fits your needs.

If we can make it simpler, we can think that our image encoder takes the grid size that you want, it applies the convolutions and average pooling over the image and generate a feature vector for each section. So if we use a grid of (5, 5) it will generate 25 embeddings, one for each section in the image.

Finally we use a simple linear layer to adjust the size of the image embeddings to fit the bitransformer's hidden size. In a normal BERT and ResNet configuration this is to transform our 2048 length vectors in 768 length vectors using the linear layer, because the BERT model has a hidden size of 768.

Multimodal transformer

One of the involved parts in the BERT training are the segment embeddings that are used to differentiate between the first sentence and the second sentence. A clever idea that the authors have was to use this same segment embeddings but to differentiate between the two modalities: text and image. This can be extended to any number of modalities, and you can make it work with the segment embedding to differentiate between the different type of inputs.

Finally the classification logits are computed using the same logic as any BERT instance, using a fully connected layer that receives as input the output vector of the [CLS] token (that is always the first token when you are using BERT) and transform it to the desired number of classes and a Softmax function with cross-entropy loss for single label outputs and for multi-label outputs you can use a Sigmoid with a binary cross-entropy loss. I usually use the Focal Loss in my projects that gives me better results (we can talk about it in another post).

Results

The authors tested the network over three different datasets: MM-IMBD, FOOD101 and V-SNLI. The baselines they used were:

A simple bag of words (BOW) using the 300-dimensional word embeddings obtained with GloVe.
A text-only model (BERT).
An image-only model (ResNet-152).
A concatenation of the feature vectors from BOW and ResNet.
And the concatenation of the BERT feature vectors with the ResNet feature vectors.

And the results can be observed in the following tables (extracted from the paper):

Conclusion

We talk about a novel model that uses self-attention over inputs from different modalities (images and text) to perform classification. The clever idea of using the same transformer to use its attention modules and combine the embeddings of the different modalities lead to a simple but powerful model that obtains better results than only looking at the image or only looking at the text.

I've being using this model in production in my work and it performs pretty well! Indeed, better than my only BERT classifier and only ResNet classifier.

You can find more information in the paper and its github repository.

Have a nice day and keep reading, learning and happy coding!

How I created this app

Fabián Souto Herrera — Sun, 19 Apr 2020 22:59:55 GMT

Creating an entire blog-style web app can be kind of difficult and very time-consuming. So we have to decide what kind of technology do we want to use and what kind of abstraction level. Maybe trying a simple WordPress site or a Ghost one with a template from their store can be sufficient. But this is not the objective of the page. I mean, obviously I want to reuse code and do not invent the wheel again, but I want to demonstrate how to create a cool web app with modern technology.

I thought about creating my own backend using Feathers framework (have you ever tried it? give it a chance!), I've been working with it some years and I love it. Its richful ecosystem and real-time api is at other level. Or maybe a simple Express app (Feathers uses express behind it) with Mongoose. A classic choice, right?

But I wanted to focus more in the front end and not having to think about security issues, modeling the database or writing a powerful content editor that only I am going to see and use. That's why I decided to host my own Ghost instance and create a custom front end app for it using Vue with Nuxt. Let's start by checking the technology stack.

Technology stack

A Ghost app for the backend.
A MySQL database to store the content of the posts.
Docker and Docker swarm to operate the Ghost app with its database.
Nginx to work with my SSL certificates and secure the data.
Digital Ocean to have cloud infrastructure as a service.
Vue + Nuxt + Vuetify for the front end app.
Netlify for front end devops.

Backend

The backend was really easy, I have it running in a couple of minutes thanks to the ready to use docker images. This is my docker-compose.yml file:

version: "3.3"
services:
  ghost:
    image: "ghost:3"
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      admin__url: 
      database__client: mysql
      database__connection__host: db
      database__connection__user: 
      database__connection__password: 
      database__connection__database: ghost
    ports:
      - 2368:2368
    volumes:
      - content:/var/lib/ghost/content
  db:
    image: mysql:5.7
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      MYSQL_ROOT_PASSWORD: 
    volumes:
      - database:/var/lib/mysql
volumes:
  content:
  database:

And with a simple docker stack deploy blog -c docker-compose.yml you can have your own Ghost instance running.

To have SSL certificates working I used Certbot and followed its simple instructions to make it work with an Nginx instance running in the host machine (not inside the containers).

Front end

I love writing front end code, it's like a hobby. But that is thanks to Vue. What an amazing progressive framework. I've been using it for a time, I started using it back in 2018 to make some simple widgets and web apps but I fell in love with it.

Nuxt

Now, for this app, I wanted to try new things and one of the first ones to came to my mind was Nuxt, the progressive Vue framework. What? A framework for a framework? That escalated quickly.

Yeah, it's weird to have a framework for a framework (welcome to Javascript (?)), but in this case it can be very useful. Apart from giving to you more structure in your code it has server-side rendering that can help with the speed of what the user see something in its screen.

Without going deeper about server-side rendering, you can think that, in simple terms, when the user requests a page of your app the server renders the components of the view in its side and sends an html document with more than the simple div where Vue has to load your app. It's faster than simple sending your JS files because the user immediately sees content, he does not have to wait for the first render after all the components and file chunks are obtained from the backend.

Composition api

If you already know Vue you probably know that the version 3 is coming! Indeed, the beta is already out. One of the great things that will come with the third major version of the framework is the composition api.

If you have worked with React you probably heard about the new hooks. Well, Vue is making its own efforts to provide the composition api. You can read more in the RFC but in simple terms you can check the power of the new api by a simple image:

Probably you were in this situation:

You started with a very simple component that perform a single action like search. You have your input and a simple search button.

Then you want to add the filters to your search component so the user can search with some tags or hashtags for example and you have to add the new methods, maybe new computed properties and the new data for this functionality.

This is a typical case where the options api have a problem. The component is doing two things that are related (so they are in the same component) but their logic is separated in the options of the component. What about now if you want to split the component because is getting bigger and bigger and you want to reuse your code and logic but in different components? Probably you should extract some data properties, then some methods and start creating the new components.

But that's the problem, your logic is distributed all along the big component. That is the explanation of the image, the color indicates related logic/code, and at the left -with the options api- you can see how it is distributed and at the right you have all your logic in a single place.

This is because with the composition api you can have simple composition functions that contains all the logic related to a single thing and it does not have to share the component's options like data and methods.

A very simple example is this composition function that I used here, in this exact app:

import { ref } from '@vue/composition-api'

/**
 * A composition function to change the color of the component
 * when the user scroll down the app.
 *
 * It must be used with the `v-scroll` directive of vuetify.
 * See: https://vuetifyjs.com/en/directives/scrolling/
 *
 * Example:
 *
 * ```javascript
 * 
 *
 *