<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Fabián Souto]]></title><description><![CDATA[Thoughts, stories and ideas.]]></description><link>https://api.blog.fabiansouto.com/</link><image><url>https://api.blog.fabiansouto.com/favicon.png</url><title>Fabián Souto</title><link>https://api.blog.fabiansouto.com/</link></image><generator>Ghost 3.13</generator><lastBuildDate>Thu, 12 Feb 2026 21:55:44 GMT</lastBuildDate><atom:link href="https://api.blog.fabiansouto.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text]]></title><description><![CDATA[How to use a BERT-like model with a convolutional network as image encoder to perform a classification task using images, texts and self attention over both modalities at the same time.]]></description><link>https://api.blog.fabiansouto.com/supervised-multimodal-bitransformers-for-classifying-images-and-text/</link><guid isPermaLink="false">5ea21525ace7b60001a7e6eb</guid><category><![CDATA[deep learning]]></category><category><![CDATA[AI]]></category><category><![CDATA[transformers]]></category><category><![CDATA[cnn]]></category><category><![CDATA[resnet]]></category><category><![CDATA[BERT]]></category><dc:creator><![CDATA[Fabián Souto Herrera]]></dc:creator><pubDate>Fri, 24 Apr 2020 01:15:11 GMT</pubDate><media:content url="https://api.blog.fabiansouto.com/content/images/2020/04/karl-lee-rP-9lYm2CVE-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://api.blog.fabiansouto.com/content/images/2020/04/karl-lee-rP-9lYm2CVE-unsplash.jpg" alt="MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text"><p>Keeping your knowledge up-to-date is very difficult in novel areas like deep learning. I know, I know... It's an amazing world but a very big one and fast moving too. Today I want to give you some insights on an advanced topic but where you can find some great novel ideas that you can use in your work or research.</p><p>Today we are going to talk about a (relative) new <em>deep learning architecture, <em>neural network</em></em> or simply a <em>model</em> that performs classification over inputs with different nature (multi-modal), specifically images and texts in one-shot, i.e. a single pass through the network.</p><p>If you are not confident working with <em>convolutional neural networks </em>and attention models like <em>transformer</em> you'll probably feel a little overwhelmed, but don't scary! We all started like that. The first time that I read the "<a href="https://arxiv.org/abs/1706.03762">Attention is all you need</a>" paper (Ashish Vaswani et al.) I didn't understand anything. I must read it at least three times to realize that I won't understand a fully new area (for me) with a single paper, I had to keep reading posts like this one, other papers and writing some code!</p><p>So don't be afraid, you only need to keep reading, being motivated and making things happen! At least, this post could give you some <em>keywords</em> that you can use to search related content and read more about this. One of the most important parts of learning is knowing what you don't know, so now I'm going to give you some <em>keyword</em> or <em>key phrases</em> to look for in Google and dig deeper!</p><h1 id="motivation">Motivation</h1><p><a href="https://arxiv.org/abs/1810.04805">BERT</a> has stolen the attention in the NLP landscape. Its amazing results (and the ones of the related models) are moving the entire field to use or at least try the <em>transformer</em> architecture. But the modern digital world is increasingly multimodal, textual information is often accompanied by other modalities like images or videos. Using the whole information can be very useful to increase the models performance. That was the objective of Douwe Kiela et al. and his team at <a href="https://ai.facebook.com/">Facebook AI Research</a>. They develop a <em>state-of-the-art</em> multi-modal bitransformer (MMBT) model to classify images and texts.</p><h1 id="multimodal-bitransformer-in-simple-terms">Multimodal Bitransformer in simple terms</h1><figure class="kg-card kg-image-card kg-width-wide"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/Screenshot-from-2020-04-23-20-08-30.png" class="kg-image" alt="MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text"></figure><p>If you have worked with BERT you know that the inputs are the tokens of the texts, right? So how can we add the images? A naive but highly competitive approach is simply extract the image features with a CNN like ResNet, extract the text-only features with a transformer like BERT, concatenate and forward them through a simple MLP or a bigger model to get the final classification logits.</p><p>The authors argue that using the power of the bitransformer's ability to employ self-attention give the model the possibility to look in the text and the image at the same time, using attention over both modalities.</p><p>So, in simple terms, we are going to take the image, extract its features with a CNN and use that features as inputs (like tokens' embeddings!) for the bitransformer. That's what you can see in the image above, we give the model the sentence and the image as input to use attention over both modalities at the same time.</p><p>For me this was that kind of moment when you think "how I did not have this idea before?!". Is very very simple and at the same time very powerful.</p><p>That's all if you only wanted to know how this works. Now, for your next project that involves images and texts you can use this idea. If you want more details keep reading.</p><h1 id="advanced-concepts">Advanced concepts</h1><p>We know that the transformer model must be pre-trained in a self-supervised fashion and the image encoder (the CNN) must be pre-trained too. It's like a standard that (if you are not Google, Facebook or Nvidia that have hundreds or thousands of GPUs) you don't even think about training your own BERT model, you just apply transfer learning and fine-tuning to use them in your task. And obviously, that is what the authors did.</p><h2 id="image-encoder">Image encoder</h2><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/image-4.png" class="kg-image" alt="MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text"></figure><p>The authors used a ResNet-152 pre-trained over <a href="http://www.image-net.org/">ImageNet</a>. The network has a stride of 32 and generates a feature map with 2048 channels. This means that if you use a simple image of 320x320 pixels the output of the network will be a feature map of 10x10 and 2048 channels. In PyTorch notation you will have a tensor with shape (2048, 10, 10). This is the same for any of the ResNet models, the only difference is with the ResNet-18 that only generates 512 channels.</p><p>Over the feature map we can apply an adaptive average pooling to transform it and get HxW = K sections. In my implementation I used a final grid of (5, 5) that leads to 25 feature vectors, so we can think that our "image sentence" is composed of 25 "image section embeddings", but you can use the final size that fits your needs.</p><p>If we can make it simpler, we can think that our image encoder takes the grid size that you want, it applies the convolutions and average pooling over the image and generate a feature vector for each section. So if we use a grid of (5, 5) it will generate 25 embeddings, one for each section in the image.</p><p>Finally we use a simple linear layer to adjust the size of the image embeddings to fit the bitransformer's hidden size. In a normal BERT and ResNet configuration this is to transform our 2048 length vectors in 768 length vectors using the linear layer, because the BERT model has a hidden size of 768.</p><h2 id="multimodal-transformer">Multimodal transformer</h2><p>One of the involved parts in the BERT training are the <em>segment embeddings </em>that are used to differentiate between the first sentence and the second sentence. A clever idea that the authors have was to use this same segment embeddings but to differentiate between the two modalities: text and image. This can be extended to any number of modalities, and you can make it work with the segment embedding to differentiate between the different type of inputs.</p><p>Finally the classification logits are computed using the same logic as any BERT instance, using a fully connected layer that receives as input the output vector of the [CLS] token (that is always the first token when you are using BERT) and transform it to the desired number of classes and a Softmax function with cross-entropy loss for single label outputs and for multi-label outputs you can use a Sigmoid with a binary cross-entropy loss. I usually use the <a href="https://arxiv.org/abs/1708.02002">Focal Loss</a> in my projects that gives me better results (we can talk about it in another post).</p><h1 id="results">Results</h1><p>The authors tested the network over three different datasets: MM-IMBD, FOOD101 and V-SNLI. The baselines they used were:</p><ul><li>A simple bag of words (BOW) using the 300-dimensional word embeddings obtained with GloVe.</li><li>A text-only model (BERT).</li><li>An image-only model (ResNet-152).</li><li>A concatenation of the feature vectors from BOW and ResNet.</li><li>And the concatenation of the BERT feature vectors with the ResNet feature vectors.</li></ul><p> And the results can be observed in the following tables (extracted from the paper):</p><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/Screenshot-from-2020-04-23-21-00-09.png" class="kg-image" alt="MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text"></figure><h1 id="conclusion">Conclusion</h1><p>We talk about a novel model that uses self-attention over inputs from different modalities (images and text) to perform classification. The clever idea of using the same transformer to use its attention modules and combine the embeddings of the different modalities lead to a simple but powerful model that obtains better results than only looking at the image or only looking at the text.</p><p>I've being using this model in production in my work and it performs pretty well! Indeed, better than my only BERT classifier and only ResNet classifier.</p><p>You can find more information in the <a href="https://arxiv.org/abs/1909.02950">paper</a> and its <a href="https://github.com/facebookresearch/mmbt">github</a> repository.</p><p>Have a nice day and keep reading, learning and happy coding!</p>]]></content:encoded></item><item><title><![CDATA[How I created this app]]></title><description><![CDATA[Which technologies are envolved here? Do you want to know how to create your own site like this one? Let's check how I created this server-side rendered page.]]></description><link>https://api.blog.fabiansouto.com/how-i-created-this-app/</link><guid isPermaLink="false">5e9cd684a59fbb0001324dc2</guid><category><![CDATA[web]]></category><category><![CDATA[vue]]></category><category><![CDATA[ghost]]></category><category><![CDATA[nuxt]]></category><dc:creator><![CDATA[Fabián Souto Herrera]]></dc:creator><pubDate>Sun, 19 Apr 2020 22:59:55 GMT</pubDate><media:content url="https://api.blog.fabiansouto.com/content/images/2020/04/patrick-fore-0gkw_9fy0eQ-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://api.blog.fabiansouto.com/content/images/2020/04/patrick-fore-0gkw_9fy0eQ-unsplash.jpg" alt="How I created this app"><p>Creating an entire blog-style web app can be kind of difficult and very time-consuming. So we have to decide what kind of technology do we want to use and what kind of abstraction level. Maybe trying a simple WordPress site or a Ghost one with a template from their store can be sufficient. But this is not the objective of the page. I mean, obviously I want to reuse code and do not invent the wheel again, but I want to demonstrate how to create a <em>cool</em> web app with modern technology.</p><p>I thought about creating my own backend using <a href="https://feathersjs.com/">Feathers framework</a> (have you ever tried it? give it a chance!), I've been working with it some years and I love it. Its richful ecosystem and real-time api is at other level. Or maybe a simple <a href="https://expressjs.com/">Express</a> app (Feathers uses express behind it) with <a href="https://mongoosejs.com/">Mongoose</a>. A classic choice, right?</p><p>But I wanted to focus more in the front end and not having to think about security issues, modeling the database or writing a powerful content editor that only I am going to see and use. That's why I decided to host my own <a href="https://ghost.org/">Ghost</a> instance and create a custom front end app for it using <a href="https://vuejs.org/">Vue</a> with <a href="https://nuxtjs.org/">Nuxt</a>. Let's start by checking the technology stack.</p><h1 id="technology-stack">Technology stack</h1><ul><li>A <strong>Ghost</strong> app for the backend.</li><li>A <strong>MySQL </strong>database to store the content of the posts.</li><li><strong>Docker </strong>and <strong>Docker swarm</strong> to operate the Ghost app with its database.</li><li><strong>Nginx</strong> to work with my SSL certificates and secure the data.</li><li><strong>Digital Ocean</strong> to have cloud infrastructure as a service.</li><li><strong>Vue </strong>+ <strong>Nuxt + Vuetify</strong> for the front end app.</li><li><strong>Netlify </strong>for front end devops.</li></ul><h1 id="backend">Backend</h1><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/image.png" class="kg-image" alt="How I created this app"></figure><p>The backend was really easy, I have it running in a couple of minutes thanks to the ready to use docker images. This is my <strong>docker-compose.yml</strong> file:</p><pre><code class="language-docker">version: "3.3"
services:
  ghost:
    image: "ghost:3"
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      admin__url: &lt;my-api-url&gt;
      database__client: mysql
      database__connection__host: db
      database__connection__user: &lt;my-db-user&gt;
      database__connection__password: &lt;my-db-password&gt;
      database__connection__database: ghost
    ports:
      - 2368:2368
    volumes:
      - content:/var/lib/ghost/content
  db:
    image: mysql:5.7
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      MYSQL_ROOT_PASSWORD: &lt;my-db-password&gt;
    volumes:
      - database:/var/lib/mysql
volumes:
  content:
  database:</code></pre><p>And with a simple <strong>docker stack deploy blog -c docker-compose.yml </strong>you can have your own Ghost instance running.</p><p>To have SSL certificates working I used <a href="https://certbot.eff.org/">Certbot</a> and followed its simple instructions to make it work with an Nginx instance running in the host machine (not inside the containers).</p><h1 id="front-end">Front end</h1><p>I love writing front end code, it's like a hobby. But that is thanks to <a href="https://vuejs.org/">Vue</a>. What an amazing progressive framework. I've been using it for a time, I started using it back in 2018 to make some simple widgets and web apps but I fell in love with it.</p><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/image-1.png" class="kg-image" alt="How I created this app"></figure><h2 id="nuxt">Nuxt</h2><p>Now, for this app, I wanted to try new things and one of the first ones to came to my mind was <a href="https://nuxtjs.org/">Nuxt</a>, the <em>progressive Vue framework</em>. What? A framework for a framework? That escalated quickly. </p><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/image-2.png" class="kg-image" alt="How I created this app"></figure><p>Yeah, it's weird to have a <em>framework for a framework</em> (welcome to Javascript (?)), but in this case it can be very useful. Apart from giving to you more structure in your code it has <em>server-side rendering </em>that can help with the speed of what the user see something in its screen.</p><p>Without going deeper about server-side rendering, you can think that, in simple terms, when the user requests a page of your app the server renders the components of the view in its side and sends an html document with more than the simple <strong>div</strong> where Vue has to load your app. It's faster than simple sending your JS files because the user immediately sees content, he does not have to wait for the first render after all the components and file chunks are obtained from the backend.</p><h2 id="composition-api">Composition api</h2><p>If you already know Vue you probably know that the version 3 is coming! Indeed, the <a href="https://github.com/vuejs/vue-next/releases/tag/v3.0.0-beta.1">beta is already out</a>. One of the great things that will come with the third major version of the framework is the <a href="https://composition-api.vuejs.org/">composition api</a>.</p><p>If you have worked with React you probably heard about the new <a href="https://reactjs.org/docs/hooks-intro.html">hooks</a>. Well, Vue is making its own efforts to provide the <em>composition api</em>. You can read more in the <a href="https://composition-api.vuejs.org">RFC</a> but in simple terms you can check the power of the new api by a simple image:</p><figure class="kg-card kg-image-card"><img src="https://api.blog.fabiansouto.com/content/images/2020/04/image-3.png" class="kg-image" alt="How I created this app"></figure><p>Probably you were in this situation:</p><blockquote>You started with a very simple component that perform a single action like <em>search</em>. You have your input and a simple search button.</blockquote><blockquote>Then you want to add the filters to your search component so the user can search with some tags or hashtags for example and you have to add the new methods, maybe new computed properties and the new data for this functionality.</blockquote><p>This is a typical case where the <em>options api</em> have a problem. The component is doing two things that are related (so they are in the same component) but their logic is separated in the options of the component.  What about now if you want to split the component because is getting bigger and bigger and you want to reuse your code and logic but in different components? Probably you should extract some data properties, then some methods and start creating the new components.</p><p>But that's the problem, your logic is distributed all along the big component. That is the explanation of the image, the color indicates related logic/code, and at the left -with the options api- you can see how it is distributed and at the right you have all your logic in a single place.</p><p>This is because with the composition api you can have simple <em>composition functions</em> that contains all the logic related to a single thing and it does not have to share the component's options like data and methods.</p><p>A very simple example is this composition function that I used here, in this exact app:</p><pre><code class="language-javascript">import { ref } from '@vue/composition-api'

/**
 * A composition function to change the color of the component
 * when the user scroll down the app.
 *
 * It must be used with the `v-scroll` directive of vuetify.
 * See: https://vuetifyjs.com/en/directives/scrolling/
 *
 * Example:
 *
 * ```javascript
 * &lt;template&gt;
 *   &lt;div v-scroll="setColorOnScroll"&gt;{{ color }}&lt;/div&gt;
 * &lt;/template&gt;
 *
 * &lt;script&gt;
 * export default {
 *   setup() {
 *     return {
 *       ...useSetColorOnScroll({ start: 'transparent', end: 'blue' })
 *     }
 *   }
 * }
 * &lt;//script&gt;
 * ```
 *
 * @param {Object} [options]
 * @param {String} [options.start] color, when the app is at the top.
 * @param {String} [options.end] color to set when the user scroll downs.
 *
 * @returns {Object} with the `color` to use in the app and a callback to
 *  change the color called `setColor` that receives the event generated
 *  by the `v-scroll` directive.
 */
export default function useSetColorOnScroll ({ start = 'transparent', end = 'blue' } = {}) {
  const color = ref(start)

  /**
   * Set the color according to the `scrollTop` property of the target
   * present in the scroll event.
   *
   * @param {Event} event triggered by the scroll action.
   * See: https://vuetifyjs.com/en/directives/scrolling/
   */
  function setColorOnScroll (event) {
    color.value = !event.target.documentElement.scrollTop ? start : end
  }

  return { color, setColorOnScroll }
}
</code></pre><p>And now every component that want to change its color when the user scrolls down can use it! As simple as that. Indeed, I used to set the color of the app bar, because when you entered this site in the home page the app bar didn't have a color, it was transparent. Cool, right?</p><p>Here is another example to easily get the context of a module in the store:</p><pre><code class="language-javascript">/**
 * Get the `state` object and the `commit`, `dispatch` and `getters` functions for the
 * module with the given namespace from the store.
 *
 * @param {String} namespace of the module of the store.
 * @param {Object} context of the setup method of the component.
 * See: https://composition-api.vuejs.org/api.html#setup
 */
export default (namespace, context) =&gt; {
  const store = context.root.$store
  const state = store.state[namespace]
  const commit = (mutation, payload) =&gt; store.commit(`${namespace}/${mutation}`, payload)
  const dispatch = (action, payload) =&gt; store.dispatch(`${namespace}/${action}`, payload)
  const getters = getter =&gt; store.getters[`${namespace}/${getter}`]

  return { commit, dispatch, getters, state }
}
</code></pre><p>You can even use composition functions inside other composition functions:</p><pre><code class="language-javascript">import { computed } from '@vue/composition-api'
import useStore from '~/compositions/useStore'

/**
 * Get the post to use from the store.
 *
 * @param {Object} context provided in the setup method
 * to get the store.
 * @returns {Object} with the `post` instance and the `id`
 * of the post.
 */
function usePost (context) {
  const id = context.root.$route.params.id
  const { state, dispatch } = useStore('posts', context)
  const post = computed(() =&gt; state.keyedById[id])

  // Dispatch the get actions. The action is smart and will not trigger
  // the api call if the item is already in the store.
  dispatch('get', { id, include: 'tags' })

  return { id, post }
}</code></pre><h1 id="vuetify">Vuetify</h1><p>To avoid inventing the wheel again I started with the awesome <a href="https://vuetifyjs.com/en/">Vuetify</a> component framework for Vue. It has really cool ready-to-use components some very useful style standards to work with themes and difficult responsive displays.</p><h1 id="source-code">Source code</h1><p>I did not describe <em>all </em>the decisions I made, how I wrote components, how I managing my page transitions and all the future plans and ideas that I want to implement in this site. But if you are curious and want to inspect the front end code, well, you can! I published it on <a href="https://github.com/SetaSouto/ghost-front">GitHub</a> so any one can use ideas or code that I've been using here. Give it a look!</p><p>While I develop the components to show my contact information you can contact me at my email <a href="mailto:fab.souto@gmail.com">fab.souto@gmail.com</a>. Don't hesitate to ask me anything, I can't promise to respond quickly but I'm going to answer you!</p>]]></content:encoded></item></channel></rss>