How to use NesT with PyTorch instead of Jax?

Use the timm library, which has incorporated NesT with pre-trained models in PyTorch. Install timm and follow its documentation, but note this is a re-implementation and might differ from the original Jax code in performance or features.

NesT vs Vision Transformer (ViT): which is better for small datasets?

NesT is specifically designed for data efficiency and scales better to small datasets, often matching convnet accuracy where ViT might underperform. For limited data scenarios, NesT is generally the superior choice due to its nested hierarchical structure.

What hardware is needed to train NesT from scratch?

Training is optimized for TPUs, as per the README with TPUv2 8x8 setups. For GPUs, it supports up to 8 GPUs for variants like NesT-T, but multi-node training is unsupported, requiring substantial compute resources for full-scale experiments.

How to fine-tune NesT on a custom image dataset?

Adapt the provided configuration files for ImageNet or CIFAR in the configs directory, modify dataset paths, and use the main.py script with Jax. However, detailed guidance for custom datasets is limited, so you may need to tweak hyperparameters based on your data.

Is NesT suitable for real-time inference?

The architecture focuses on accuracy and data efficiency, not inference optimization. While pre-trained models can be used for inference, speed depends on the variant and hardware; you might need additional engineering for real-time applications.

How does NesT compare to ResNet for image classification?

NesT aims to rival convnets like ResNet on small datasets, offering similar accuracy with transformer benefits. On ImageNet, NesT-B achieves 83.8% accuracy, comparable to many ResNet variants, but may require more compute for training due to its transformer-based design.

NesT

Apache-2.0Jupyter Notebook

A vision transformer architecture that aggregates nested local transformers on image blocks for better accuracy, data efficiency, and convergence.

GitHub

What is NesT?

Nested Hierarchical Transformer (NesT) is a vision transformer architecture that aggregates nested local transformers on image blocks to improve image classification. It addresses limitations in standard vision transformers by enhancing accuracy, data efficiency, and convergence, particularly on benchmarks like ImageNet. The method is designed to scale effectively to smaller datasets, matching the performance of convolutional neural networks.

Target Audience

Researchers and practitioners in computer vision and deep learning who are working on image classification, vision transformer improvements, or efficient model architectures for limited data scenarios.

Value Proposition

Developers choose NesT for its simple yet effective hierarchical design that boosts vision transformer performance without complex modifications. Its pre-trained models and compatibility with frameworks like Jax and PyTorch (via timm) make it accessible for both research and practical applications.

Overview

Nested Hierarchical Transformer https://arxiv.org/pdf/2105.12723.pdf

Use Cases

Best For

Improving vision transformer accuracy on ImageNet

Related Projects

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

204 stars27 forks0 contributors

Training vision models with limited datasets

Research on hierarchical neural network architectures

Image classification tasks requiring data efficiency

Comparing transformer-based models against convolutional networks

Implementing nested transformer designs in Jax or PyTorch

Not Ideal For

Production deployments requiring extensive documentation and official support
Teams exclusively using PyTorch without tolerance for Jax dependencies or third-party re-implementations
Applications beyond image classification, such as object detection or video analysis
Small teams lacking access to TPUs or high-end GPU clusters for optimal training

Pros & Cons

Pros

Enhanced Data Efficiency

Achieves high accuracy with less training data than standard vision transformers, as shown by its performance on ImageNet with smaller datasets, closing the gap with convolutional networks.

Improved Convergence

Optimizes training stability and speed, with pre-trained models like NesT-B reaching 83.8% top-1 accuracy on ImageNet, demonstrating reliable benchmark results.

Scalability to Small Datasets

Designed to match convolutional neural network accuracy even with limited data, making it effective for research or applications where data is scarce.

Available Pre-trained Models

Includes checkpoints for NesT-B, NesT-S, and NesT-T variants with reported ImageNet accuracies, facilitating quick evaluation and fine-tuning without full retraining.

Cons

Complex Setup for Training

Requires TPU configuration with specific IP addresses and Jax backend for optimal performance, and the codebase does not support multi-node GPU training beyond 8 GPUs, limiting scalability.

Limited Framework Support

Primary implementation is in Jax, with PyTorch support only through third-party libraries like timm, which may lack full feature parity or official updates.

Research-Focused, Not Production-Ready

Explicitly stated as 'not an officially supported Google product,' leading to sparse documentation, minimal support for deployment, and potential breaking changes.

Frequently Asked Questions

Home

JAX

Vision Transformer

This repository provides the official implementation of Vision Transformer (ViT) and MLP-Mixer architectures for image recognition, based on seminal research papers from Google Research. It includes pre-trained models on datasets like ImageNet and ImageNet-21k, along with code for fine-tuning on custom datasets using JAX and Flax. ## Key Features - **Vision Transformer (ViT)** — Applies transformer architecture to image patches for scalable image recognition. - **MLP-Mixer** — An all-MLP architecture for vision tasks, offering an alternative to convolutional networks. - **Pre-trained Models** — Includes a wide variety of ViT and Mixer models (e.g., ViT-B/16, ViT-L/16, Mixer-B/16) pre-trained on ImageNet and ImageNet-21k. - **Fine-tuning Support** — Provides configurable scripts to fine-tune models on datasets like CIFAR-10, CIFAR-100, and custom datasets. - **LiT Models** — Includes Locked-image text Tuning models for zero-shot transfer learning with image-text alignment. - **Cloud Integration** — Supports training on Google Cloud VMs with GPU or TPU accelerators. ## Philosophy The project emphasizes reproducibility and accessibility of state-of-the-art vision models, offering well-documented code and pre-trained checkpoints to facilitate research and practical applications in computer vision.

Stars12,636

Forks1,476

Last commit13 days ago

Big Transfer (BiT)

Official repository for the "Big Transfer (BiT): General Visual Representation Learning" paper.

Stars1,543

Forks175

Last commit2 years ago

mip-NeRF

Mip-NeRF is an extension of Neural Radiance Fields (NeRF) that addresses aliasing artifacts by representing scenes at continuously-valued scales. It renders anti-aliased conical frustums instead of single rays, enabling higher-quality synthesis of novel views from 2D images while being faster and more compact than the original NeRF. ## Key Features - **Multiscale Scene Representation** — Models scenes at continuous scales to handle varying image resolutions. - **Anti-Aliased Rendering** — Renders conical frustums instead of rays, reducing blur and aliasing artifacts. - **Improved Detail Preservation** — Significantly enhances NeRF's ability to capture fine details. - **Computational Efficiency** — 7% faster than NeRF and half the model size, while reducing error rates by 17-60%. - **Scalable Performance** — Matches brute-force supersampled NeRF accuracy while being 22x faster on multiscale datasets. ## Philosophy Mip-NeRF is designed to efficiently solve the aliasing problem in neural rendering by integrating multiscale representation directly into the NeRF framework, prioritizing both rendering quality and computational performance.

Stars939

Forks112

Last commit3 years ago