deep learning MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text How to use a BERT-like model with a convolutional network as image encoder to perform a classification task using images, texts and self attention over both modalities at the same time.