A curated list of recent research papers and resources on Vision and Language Pre-trained Models (VL-PTMs).
awesome-vision-language-pretraining-papers is a curated GitHub repository that aggregates recent research papers, code, and resources on Vision and Language Pre-trained Models (VL-PTMs). It provides a structured overview of multimodal AI models that learn joint representations from visual and textual data, facilitating tasks like visual question answering, image captioning, and cross-modal retrieval. The repository serves as a living survey, tracking progress from early models like ViLBERT to contemporary foundation models.
AI researchers, graduate students, and machine learning engineers working on multimodal learning, computer vision, or natural language processing who need a consolidated reference for state-of-the-art VL-PTM literature.
It saves significant literature review time by offering a meticulously organized, modality-wise taxonomy of papers with direct links to arXiv and code. Unlike generic paper lists, it focuses specifically on pretrained vision-language models, includes task-specific adaptations, and points to critical analysis studies.
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Organizes papers by modality (image, video, speech) and research focus (e.g., representation learning), making literature review efficient, as seen in the table of contents.
Includes GitHub links for many models like ViLBERT and UNITER, aiding in implementation and reproducibility, as highlighted in the entries with [code] tags.
Encompasses diverse modalities with papers such as VideoBERT for video and SpeechBERT for speech, providing a holistic view of VL-PTMs.
Points to related surveys and reading lists, like the JAIR 2021 survey, offering extended learning beyond individual papers.
Last updated in June 2021, missing significant advances from 2022 onwards, such as newer foundation models, limiting its relevance for current research.
Merely lists papers without summaries, comparisons, or insights, requiring users to parse dense academic material independently for evaluation.
Focuses on research papers with minimal implementation tips or deployment advice, not ideal for quick prototyping or real-world applications.