A Torch implementation of a VIS+LSTM model for answering questions about images using deep learning.
neural-vqa is a Torch-based deep learning model for Visual Question Answering (VQA), which answers natural language questions about images. It implements the VIS+LSTM architecture from academic research, combining convolutional neural network features with recurrent networks to understand both visual and textual inputs.
Researchers and students in machine learning, computer vision, and NLP who want to experiment with or understand multimodal AI models for visual question answering.
It offers a clean, documented implementation of a published VQA model with pre-trained checkpoints, making it easier to reproduce research results or build upon the architecture for new experiments.
:grey_question: Visual Question Answering in Torch
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Faithfully replicates the VIS+LSTM architecture from the 2015 paper, providing a reproducible baseline for academic study and experimentation.
Includes downloadable checkpoints (e.g., vqa_epoch23.26_0.4610.t7) for immediate inference, saving time on training from scratch.
Seamlessly works with MSCOCO and VQA datasets using provided scripts, aligning with common benchmarks in the field.
Configurable for both GPU and CPU execution via the gpuid option, allowing experimentation on varied hardware setups.
Built on Torch with Lua, which has been largely superseded by PyTorch, leading to compatibility issues and a shrinking community.
Requires manual download of large datasets and VGG-19 models, with known memory issues in LuaJIT necessitating workarounds like using Lua 5.1.
Based on older research without modern techniques, achieving ~46% accuracy on VQA validation, which is lower than current state-of-the-art models.