Multilingual Visual Question Answering

bharatR · June 30, 2021, 6:50am

Multilingual Visual Question Answering

We are currently planning to use ViT + mBert for two modalities namely images and text for Multilingual Visual Question Answering task

Model

Pre-trained models ViT , mBERT , can be used for this task.

4. Datasets

Currently we came across one dataset WIT which is a large multimodal multilingual dataset , We can update the datasets if we encounter more suitable dataset for this task.

5. Training scripts

Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

6. (Optional) Challenges

How to combine the two modalities i.e. image and text, some of the ways to this is like VisualBert ,Here we can use ViT for images and use the embeddings from earlier stage for mBERT along with word embeddings. One other way for interaction of two modalities can be done through coattentional transformer layer like in ViLBERT, so we need to decide which would be better suitable for our task.

7. (Optional) Desired project outcome

Final Goal is to have an end to end model where we can perform Visual QA Task in multiple languages.
We are also planning to benchmark the few-Shot/zero-shot performance on VQA/VizWiz/GQA

gchhablani · June 30, 2021, 7:25am

Count me in!

patrickvonplaten · June 30, 2021, 11:55am

Awesome let’s finalize this project

Ankit-Kumar-Saini · July 1, 2021, 12:18pm

Great project. Looking to forward to working with you.

tghosh93 · July 1, 2021, 6:32pm

I want to join this project. I hope I’m not late.

patrickvonplaten · July 1, 2021, 11:52pm

Added you guys!

asharma85 · July 2, 2021, 11:26am

Hey all
I am a beginner with Transformers and FLAX and want to get into Transformers via a project. Basically what I would like to take away is - knowledge about what it means to get a pre-trained model tuned to another task using HuggingFace’s infrastructure/pipeline and learn some FLAX basics

I have sufficient background in CV and Deep Learning (Object detection.Semantic Seg etc via SSD, U-Net) in the autonomous driving space using PyTorch
I do like this idea and would like to sign up. Do you think I can help out in some way?

sarcasticvibes · July 2, 2021, 5:47pm

Hi guys, I would also like to be a part of this project.

reichenbach · July 2, 2021, 7:31pm

Hi @patrickvonplaten, can you add me here in this Multilingual Visual QA ?

Topic		Replies	Views
Vision-Language Project Ideas Flax/JAX Projects	13	1653	June 30, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1327	July 6, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1887	July 4, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2202	January 4, 2022
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	416	June 29, 2021

Multilingual Visual Question Answering

Multilingual Visual Question Answering

Model

4. Datasets

5. Training scripts

6. (Optional) Challenges

7. (Optional) Desired project outcome

Related topics