Is Deep Spark still maintained?

No, Deep Spark was deprecated in 2015 and is no longer actively developed or supported. For current projects, consider alternative connectors like Spark's native ones or community-maintained libraries.

How to connect Spark to Cassandra with Deep?

Deep provides configuration objects and factory methods to create RDDs mapped to Cassandra tables, using either entity annotations or the cell API, as shown in the first steps with Spark and Cassandra section.

Deep Spark vs DataStax Spark Cassandra connector?

Deep Spark supports multiple datastores beyond Cassandra, but it's deprecated and less optimized than specialized connectors like DataStax's, which are actively maintained and support newer Spark versions.

What databases does Deep Spark support?

It connects to Cassandra, MongoDB, Elasticsearch, Aerospike, HDFS, S3, and any JDBC-compatible database, as listed in the key features and README.

How to install Deep Spark from source?

You must clone the repository, install dependencies like Oracle JDBC driver manually, compile with Maven in the deep-parent directory, and run the distribution script, as detailed in the compiling and distribution sections.

Can Deep Spark handle Elasticsearch 7.x?

No, Deep Spark only supports Elasticsearch 1.3.0+, so it's incompatible with newer versions like 7.x, limiting its use in modern data stacks.

Migrating from Deep Spark to newer alternatives?

Yes, for many use cases, Spark's native connectors or community projects are better choices, though they may lack Deep's unified API. Migration involves rewriting integration code to use updated libraries.

Open-Awesome

Deep Spark

Apache-2.0Java

A thin integration layer connecting Apache Spark with various NoSQL datastores and JDBC databases.

Visit Website GitHub

197 stars43 forks0 contributors

What is Deep Spark?

Deep (Deep-Spark) is an open-source integration layer that connects Apache Spark with various NoSQL datastores and JDBC databases. It enables Spark to read from and write to systems like Cassandra, MongoDB, Elasticsearch, and Aerospike, providing a unified API for data processing across heterogeneous sources. The project simplifies big data workflows by abstracting connector complexities and allowing developers to work with Spark RDDs directly mapped to database entities.

Target Audience

Data engineers and developers working with Apache Spark who need to integrate multiple data stores (NoSQL and SQL) into their Spark processing pipelines. It is particularly useful for teams managing polyglot persistence environments.

Value Proposition

Deep offers a single, consistent API for multiple datastores, reducing the need for custom connectors. Its dual interface (ORM and cell-based) provides flexibility for both structured and schema-less data, and it optimizes data fetching to leverage Spark's distributed processing capabilities efficiently.

Overview

Connecting Apache Spark with different data stores [DEPRECATED]

Use Cases

Best For

Processing data from Cassandra column families in Spark applications
Integrating MongoDB collections with Spark for analytical workloads
Connecting Elasticsearch indices to Spark for data transformation
Using Aerospike as a data source or sink for Spark jobs
Accessing JDBC databases through Spark for ETL pipelines
Unifying data access across multiple NoSQL stores in a single Spark job

Not Ideal For

Projects using Spark 2.x or later, as Deep only supports Spark 1.1.1
Teams needing active maintenance or security updates, since it's deprecated since 2015
Simple ETL tasks with single data sources where Spark's built-in connectors suffice
Environments with modern database versions (e.g., Elasticsearch 7.x, Cassandra 3.x) not listed in requirements

Pros & Cons

Pros

Unified Multi-Datastore API

Provides a single API for Cassandra, MongoDB, Elasticsearch, Aerospike, and JDBC sources, reducing connector complexity, as per the key features listing.

Flexible Data Access Models

Offers both an ORM-like entity API with annotation-driven mapping and a generic cell API for schema-less data, detailed in the Cassandra integration section.

Optimized Spark Integration

Creates Spark RDDs directly mapped to databases and optimizes data fetching to leverage Spark's computational capabilities, as mentioned in key features.

Comprehensive Example Suite

Includes working Java and Scala examples for all supported datastores in the deep-examples subproject, easing onboarding.

Cons

Deprecated and Unmaintained

Project was deprecated in 2015, meaning no bug fixes, updates, or support for newer technologies, as stated in the README disclaimer.

Complex and Manual Setup

Requires compiling from source, manually installing dependencies like Oracle JDBC driver, and running distribution scripts, as outlined in installation steps.

Limited Version Compatibility

Only supports specific old versions like Spark 1.1.1 and databases such as Elasticsearch 1.3.0+, making it incompatible with modern stacks.

Ecosystem Lock-in Risk

Encourages use of Stratio's platform (e.g., their VM) and tools, potentially increasing dependency and reducing portability.

Frequently Asked Questions

Related Projects

PyHive

Python interface to Hive and Presto. 🐝

Stars1,696

Forks546

Last commit3 months ago

Substation

Substation is a toolkit for routing, normalizing, and enriching security event and audit logs.

Stars403

Forks35

Last commit6 months ago

Delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

Stars345

Forks58

Last commit2 years ago

Hivemall

Mirror of Apache Hivemall (incubating)

Stars313

Forks111

Last commit3 years ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub