An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.
spark-connect-rs is an experimental Rust client for Apache Spark Connect that allows Rust developers to interact with Apache Spark clusters using a DataFrame API. It solves the problem of integrating Rust applications with Spark's distributed data processing capabilities, providing a native alternative to existing Spark clients in other languages. The project enables data manipulation, streaming, and SQL operations directly from Rust code.
Rust developers and data engineers who need to integrate Rust applications with Apache Spark for distributed data processing, ETL pipelines, or analytics workloads. It's also relevant for teams exploring Rust's potential in big data ecosystems.
Developers choose spark-connect-rs to leverage Rust's performance and safety features while accessing Spark's distributed computing power. It offers a proof-of-concept for Rust-Spark integration, with a growing API surface that mirrors familiar Spark patterns, though it's not yet production-ready.
Apache Spark Connect Client for Rust
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides a Rust-native DataFrame API closely mirroring PySpark and Spark Scala APIs, with extensive implemented methods like select, filter, and join, as listed in the README tables.
Includes DataStreamReader and DataStreamWriter for structured streaming, with key APIs like start, trigger, and output modes marked as done, enabling real-time data processing.
Offers conversions to Polars and DataFusion formats via to_polars and to_datafusion methods, facilitating integration with other popular Rust data tools.
Covers many Spark Connect protocols, including catalog management, session handling, and multi-format I/O, with numerous functions and data types implemented.
Explicitly labeled as 'highly experimental' and not for production, with missing critical features like UDFs and foreach methods, reducing reliability for real-world use.
Requires non-trivial setup of Spark Connect server, plus dependencies like cmake and protobuf, making initial configuration more cumbersome than mature clients.
Many Spark APIs are marked as open or partial, such as UDF registration, mapInPandas, and some window functions, limiting functionality for advanced workflows.