TensorFlow binding for Apache Spark DataFrames, enabling TensorFlow program execution on Spark data.
TensorFrames is an experimental library that integrates TensorFlow with Apache Spark DataFrames, enabling distributed TensorFlow computations directly on Spark clusters. It allows users to manipulate Spark DataFrames using TensorFlow programs, bridging distributed data processing and machine learning operations. However, note that the project is deprecated, with pandas UDFs recommended as an alternative.
Data engineers and data scientists working with Apache Spark who need to run TensorFlow computations on distributed datasets within Spark workflows. It is particularly relevant for those using Scala or Python (via PySpark) in Spark 2.4+ environments.
Developers choose TensorFrames for its ability to execute TensorFlow graphs on Spark DataFrames without extensive infrastructure changes, offering automatic shape inference and block-wise operations for efficient processing. It provides a seamless integration between Spark's distributed data processing and TensorFlow's machine learning capabilities, though it is now deprecated in favor of pandas UDFs.
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides direct integration between TensorFlow and Apache Spark DataFrames, enabling distributed TensorFlow computations without major infrastructure changes, as outlined in the README's philosophy.
Offers interfaces for both Scala and Python, with Python support via PySpark, giving flexibility for developers in different ecosystems.
Supports efficient map and reduce operations on DataFrame blocks, optimizing data processing for scalable machine learning tasks.
Infers tensor shapes from DataFrame schemas, reducing manual configuration and potential errors, as demonstrated in the example code.
The project is no longer maintained, with pandas UDFs recommended as the alternative, leading to potential security vulnerabilities and lack of bug fixes.
The README explicitly states 'there are still some areas of low performance,' which can limit efficiency in data-intensive applications.
Officially supports only Linux 64-bit platforms, restricting deployment in diverse or cloud-based environments.
Requires additional dependencies like protoc and specific Python environments, making compilation and installation cumbersome, as detailed in the developer instructions.