Official JAX/Flax implementation of Vision Transformer (ViT) and MLP-Mixer for image recognition, with pre-trained models.
Vision Transformer is an open-source implementation of transformer-based models for image recognition, including ViT and MLP-Mixer architectures. It provides pre-trained models and tools for fine-tuning on various vision tasks, enabling scalable and efficient image classification. The project also includes LiT models for zero-shot transfer learning with image-text alignment.
Machine learning researchers and practitioners working on computer vision tasks, especially those interested in transformer architectures, model fine-tuning, and reproducible experiments with state-of-the-art vision models.
Developers choose this project for its official implementation of groundbreaking vision models, extensive collection of pre-trained checkpoints, and seamless integration with JAX/Flax for high-performance training on GPUs and TPUs.
This repository provides the official implementation of Vision Transformer (ViT) and MLP-Mixer architectures for image recognition, based on seminal research papers from Google Research. It includes pre-trained models on datasets like ImageNet and ImageNet-21k, along with code for fine-tuning on custom datasets using JAX and Flax.
The project emphasizes reproducibility and accessibility of state-of-the-art vision models, offering well-documented code and pre-trained checkpoints to facilitate research and practical applications in computer vision.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides exact code from seminal Google Research papers, ensuring reproducibility and state-of-the-art accuracy as shown in detailed performance tables.
Includes over 50k checkpoints like ViT and Mixer variants pre-trained on ImageNet-21k, with AugReg models offering best-in-class accuracy for transfer learning.
Offers detailed scripts and guides for training on Google Cloud VMs with GPU or TPU accelerators, enabling large-scale experiments.
Configurable scripts for datasets like CIFAR-10/100 and custom datasets via TensorFlow Datasets, with examples for trade-offs in accuracy and compute.
Requires separate installation steps for GPU vs. TPU and specific Python versions, with noted issues like Colab's slow network for TPUs hindering ease of use.
Optimal training relies on Google Cloud VMs and storage buckets (e.g., gs://vit_models), creating vendor dependency and potential cost barriers.
Large models like ViT-L/16 require tuning of batch sizes and accumulation steps to avoid out-of-memory errors, as admitted in the README's notes on memory.
Primarily JAX/Flax-based with minimal native integration for other frameworks; though some checkpoints work with PyTorch's timm, core training code is JAX-only.