A domain-specific language and C++ library for automatically synthesizing high-performance machine learning kernels.
Tensor Comprehensions is a domain-specific language and C++ library for automatically generating high-performance machine learning kernels. It allows researchers and engineers to express tensor operations at a high level and automatically synthesizes optimized GPU or CPU implementations, eliminating the need for manual kernel programming.
Machine learning researchers and engineers working on performance-critical model components, particularly those needing to deploy custom operations across different hardware platforms.
It dramatically reduces the time and expertise required to create optimized kernels, provides framework-agnostic portability, and achieves near-peak hardware performance through automated autotuning and evolutionary search.
A domain specific language to express machine learning workloads.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Translates high-level tensor operations into optimized GPU/CPU kernels using Halide and ISL, eliminating the need for manual low-level coding as demonstrated in the example code.
Uses genetic search to automatically find high-performance configurations, achieving up to 80% of peak shared memory bandwidth per the README's performance claims.
Works with any tensor library supporting memory operations, with ready integrations for Caffe2 and PyTorch, enabling cross-framework portability.
Compiles kernels on-demand for specific tensor sizes, allowing adaptive optimization and reuse of autotuned options across different problem sizes.
The README explicitly states that 'solid register-level optimizations are still in the work,' meaning peak performance for some operations may not be fully realized.
Requires installing multiple C++ libraries like Halide, ISL, and CUDA/LLVM, with documentation pointing to conda or docker for ease, indicating a non-trivial setup process.
Autotuning involves generations of evolutionary search, taking significant time (e.g., 27 seconds in the provided example), which can slow down rapid prototyping iterations.
Only integrates directly with Caffe2 and PyTorch; other frameworks require manual effort for tensor management, reducing immediate usability for broader ecosystems.