A thin integration layer connecting Apache Spark with various NoSQL datastores and JDBC databases.
Deep (Deep-Spark) is an open-source integration layer that connects Apache Spark with various NoSQL datastores and JDBC databases. It enables Spark to read from and write to systems like Cassandra, MongoDB, Elasticsearch, and Aerospike, providing a unified API for data processing across heterogeneous sources. The project simplifies big data workflows by abstracting connector complexities and allowing developers to work with Spark RDDs directly mapped to database entities.
Data engineers and developers working with Apache Spark who need to integrate multiple data stores (NoSQL and SQL) into their Spark processing pipelines. It is particularly useful for teams managing polyglot persistence environments.
Deep offers a single, consistent API for multiple datastores, reducing the need for custom connectors. Its dual interface (ORM and cell-based) provides flexibility for both structured and schema-less data, and it optimizes data fetching to leverage Spark's distributed processing capabilities efficiently.
Connecting Apache Spark with different data stores [DEPRECATED]
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides a single API for Cassandra, MongoDB, Elasticsearch, Aerospike, and JDBC sources, reducing connector complexity, as per the key features listing.
Offers both an ORM-like entity API with annotation-driven mapping and a generic cell API for schema-less data, detailed in the Cassandra integration section.
Creates Spark RDDs directly mapped to databases and optimizes data fetching to leverage Spark's computational capabilities, as mentioned in key features.
Includes working Java and Scala examples for all supported datastores in the deep-examples subproject, easing onboarding.
Project was deprecated in 2015, meaning no bug fixes, updates, or support for newer technologies, as stated in the README disclaimer.
Requires compiling from source, manually installing dependencies like Oracle JDBC driver, and running distribution scripts, as outlined in installation steps.
Only supports specific old versions like Spark 1.1.1 and databases such as Elasticsearch 1.3.0+, making it incompatible with modern stacks.
Encourages use of Stratio's platform (e.g., their VM) and tools, potentially increasing dependency and reducing portability.