Question 1

What is the best vision-language model for image captioning?

Accepted Answer

Based on the repository, models like Oscar and VinVL are highlighted for captioning in the image-based section, but since updates stopped in 2021, check recent benchmarks for state-of-the-art options like Florence or newer models.

Question 2

How to find code for a specific paper in this repository?

Accepted Answer

Navigate the structured sections by modality, e.g., image-based VL-PTMs, and look for entries with [code] links. For example, the ViLBERT entry includes a GitHub link for direct access to implementation.

Question 3

ViLT vs VisualBERT: which one is better for visual question answering?

Accepted Answer

ViLT is more recent and efficient as it avoids convolutions, while VisualBERT is a foundational baseline. The repository lists both; performance varies by dataset, so review the papers for specific task results.

Question 4

Are there video-text pretraining models with open-source code available?

Accepted Answer

Yes, papers like ActBERT and HERO in the video-based section have code links, but verify the GitHub repositories for availability, maintenance, and compatibility with your setup.

Question 5

How to contribute or suggest updates to this repository?

Accepted Answer

The README doesn't specify contribution guidelines; it's maintained by a single author. You might need to contact the maintainer via email or fork the repository for personal updates, as it shows no recent activity.

Question 6

What datasets are commonly used in these vision-language papers?

Accepted Answer

The repository doesn't list datasets explicitly, but papers often reference COCO, Visual Genome, or VQA datasets. Check individual paper links or the supplementary surveys for detailed dataset information.

Awesome Vision + Language

What is Awesome Vision + Language?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions