A neural network for real-time 6D object pose tracking in video using RGB-D data, trained only on synthetic images.
se(3)-TrackNet is a neural network architecture for 6D pose tracking that estimates the position and orientation of known objects in video sequences using RGB-D data. It solves the challenges of occlusion, lack of real-world annotated data, and error drift in long-term tracking by training exclusively on synthetic images. The approach optimizes relative pose between current observations and synthetic renderings conditioned on previous estimates.
Researchers and engineers in robotics, computer vision, and AR/VR who need real-time, robust 6D pose tracking for manipulation, model-based reinforcement learning, or human-robot interaction.
It outperforms alternatives in robustness and computational efficiency (90.9 Hz) while requiring only synthetic data for training, eliminating the need for costly real-world pose annotations. Its novel architecture reduces domain shift and uses Lie Algebra for effective orientation representation.
[IROS 2020] se(3)-TrackNet: Data-driven 6D Pose Tracking by Calibrating Image Residuals in Synthetic Domains
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Trained entirely on synthetic data, eliminating the need for costly real-world annotations, as demonstrated by effective generalization to real RGB-D images in benchmarks.
Achieves a tracking frequency of 90.9 Hz, making it suitable for real-time applications like robotic manipulation and AR/VR, as cited in the paper.
Designed to handle significant occlusions common in manipulation tasks, with performance validated on the YCBInEOAT dataset featuring real robotic interactions.
Uses Lie Algebra for 3D orientation representation, improving pose estimation accuracy as highlighted in the neural network architecture.
Requires a precise CAD model for each target object, limiting use to known objects and necessitating model preparation, as acknowledged with the referral to BundleTrack for unknown objects.
Setup involves Docker, large dataset downloads (e.g., 15G for YCB_Video), and synthetic data generation, which can be time-consuming and resource-intensive for quick prototyping.
Relies on depth data from RGB-D cameras, so it's not applicable to standard RGB-only video streams, restricting use in environments without depth sensors.