A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
The Apache Cassandra Spark Connector is a library that enables Apache Spark to read data from and write data to Apache Cassandra databases. It solves the problem of integrating distributed data processing with scalable NoSQL storage by exposing Cassandra tables as Spark RDDs and DataFrames, allowing seamless data exchange between the two systems.
Data engineers and developers building analytics pipelines that require processing large-scale data stored in Cassandra using Spark's distributed computing capabilities.
Developers choose this connector because it provides native integration between Spark and Cassandra with optimized data type conversions, server-side filtering, and efficient join operations, enabling high-performance analytics on Cassandra data without complex ETL processes.
Apache Spark to Apache Cassandra connector
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Exposes Cassandra tables directly as Spark RDDs and DataFrames, enabling distributed processing without manual ETL, as highlighted in the README's key features.
Handles all Cassandra data types, including collections and vectors, ensuring accurate mapping between systems, with recent updates adding support for AI and RAG data.
Filters data on Cassandra servers using CQL WHERE clauses, reducing network overhead and improving performance, as specified in the server-side filtering feature.
Supports DataFrames API in Python and R, broadening usability beyond Scala, making it accessible for diverse development teams.
The connector has multiple branches for different Spark and Cassandra versions, leading to confusion and maintenance challenges during upgrades, as shown in the compatibility table.
Integration tests require CCM installation and specific configurations, adding complexity to development and deployment, as noted in the Testing section.
While multi-language support exists, the primary APIs and documentation are Scala-focused, which might limit ease of use for Java or Python developers.