How to set up PackNet-SfM without Docker?

The README strongly recommends Docker for reproducibility on Ubuntu 18.04. Without it, you must manually create a conda environment by following the Dockerfile steps, which is complex and unsupported.

PackNet-SfM vs MiDaS for depth estimation accuracy?

PackNet-SfM excels in self-supervised learning from video with real-time optimization for autonomous driving, while MiDaS is often preferred for general-purpose depth from single images with better generalization across diverse scenes. Choose based on your need for video-based training versus image-based inference.

How to train PackNet-SfM on a custom video dataset?

Prepare your dataset in the same format as KITTI or DDAD, place it in /data/datasets/, and modify the YAML configuration file to specify dataset parameters, then run the training script as described in the Training section.

What GPU is needed for real-time inference with PackNet-SfM?

You need an NVIDIA GPU with at least 6GB memory, but for higher resolutions or larger models, more memory is required. TensorRT optimization is recommended for real-time performance, as noted in the README.

Can PackNet-SfM work with stereo cameras?

No, PackNet-SfM is designed for monocular depth estimation from video sequences, using self-supervised learning with ego-motion. For stereo setups, you would need to adapt the framework or use other methods specifically built for stereo vision.

How to convert PackNet-SfM models to TensorRT?

The README mentions real-time inference with TensorRT but doesn't provide explicit conversion steps. You likely need to export the PyTorch model and use NVIDIA's TensorRT tools, which requires additional expertise and setup.

Open-Awesome

packnet-sfm

MITPythonv0.1.2

A PyTorch implementation of self-supervised monocular depth estimation using 3D packing for high-resolution, real-time depth prediction.

Visit Website GitHub

1.3k stars245 forks0 contributors

What is packnet-sfm?

PackNet-SfM is a self-supervised monocular depth estimation framework that predicts depth maps from single images or video sequences without requiring labeled depth data. It solves the problem of accurate 3D scene understanding for applications like autonomous driving by learning from video alone, using a novel 3D packing architecture to preserve fine details and enable real-time performance.

Target Audience

Computer vision researchers and engineers working on autonomous driving, robotics, and 3D scene understanding who need accurate, efficient depth estimation without costly ground-truth data.

Value Proposition

Developers choose PackNet-SfM for its state-of-the-art self-supervised performance, ability to generalize across camera models (including non-pinhole), and real-time inference capabilities, all while being open-source and backed by extensive research from Toyota Research Institute.

Overview

TRI-ML Monocular Depth Estimation Repository

Use Cases

Best For

Self-supervised depth estimation from monocular video for autonomous vehicles
Real-time depth prediction in robotics and drone navigation
3D scene reconstruction without LiDAR or ground-truth depth labels
Depth estimation on non-pinhole cameras like fisheye or catadioptric systems
Academic research in computer vision and self-supervised learning
Benchmarking depth estimation models on datasets like DDAD and KITTI

Not Ideal For

Projects needing depth estimation from single static images without video sequences
Applications deployed on hardware with less than 6GB GPU memory
Teams wanting a production-ready, actively maintained codebase with extensive community support

Pros & Cons

Pros

Innovative 3D Packing

Uses symmetric packing and unpacking blocks with 3D convolutions to compress detail-preserving representations, enabling high-resolution depth prediction as shown in the CVPR 2020 paper.

No Ground-Truth Data

Trained self-supervised only on monocular videos, eliminating the need for expensive depth labeling, which is a core advantage highlighted in the framework's description.

Real-Time Inference

Optimized for real-time performance using TensorRT, making it suitable for autonomous driving applications where speed is critical, as noted in the README.

Camera Model Flexibility

Extends to non-pinhole cameras like fisheye through Neural Ray Surfaces, allowing depth estimation beyond traditional models, based on the 3DV 2020 implementation.

Cons

Complex Docker Setup

Requires Docker and is only tested on Ubuntu 18.04, with additional configuration for AWS and WANDB, making initial setup cumbersome and error-prone.

High Hardware Demands

Needs at least 6GB of GPU memory, and more for larger models or higher resolutions, which can be prohibitive for resource-constrained environments.

Deprecated Codebase

The README states that future development has moved to a new repository (vidar), limiting updates and support for this version, potentially leaving users with outdated tools.

Frequently Asked Questions

Related Projects

detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

Stars34,425

Forks7,925

Last commit21 days ago

EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Stars29,366

Forks3,562

Last commit4 months ago

imgaug

Image augmentation for machine learning experiments.

Stars14,734

Forks2,455

Last commit1 year ago

meshroom

Node-based Visual Programming Toolbox

Stars12,695

Forks1,207

Last commit18 hours ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub