deep learning - Fabián Souto

MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text

How to use a BERT-like model with a convolutional network as image encoder to perform a classification task using images, texts and self attention over both modalities at the same time.