Multilingual Visual Question Answering
We are currently planning to use ViT + mBert for two modalities namely images and text for Multilingual Visual Question Answering task
Model
Pre-trained models ViT , mBERT , can be used for this task.
4. Datasets
Currently we came across one dataset WIT which is a large multimodal multilingual dataset , We can update the datasets if we encounter more suitable dataset for this task.
5. Training scripts
Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
6. (Optional) Challenges
- How to combine the two modalities i.e. image and text, some of the ways to this is like VisualBert ,Here we can use ViT for images and use the embeddings from earlier stage for mBERT along with word embeddings. One other way for interaction of two modalities can be done through coattentional transformer layer like in ViLBERT, so we need to decide which would be better suitable for our task.
7. (Optional) Desired project outcome
Final Goal is to have an end to end model where we can perform Visual QA Task in multiple languages.
We are also planning to benchmark the few-Shot/zero-shot performance on VQA/VizWiz/GQA

and want to get into Transformers via a project. Basically what I would like to take away is - knowledge about what it means to get a pre-trained model tuned to another task using HuggingFace’s infrastructure/pipeline and learn some FLAX basics