A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.
Neo4j Connector for Apache Spark is an open-source library that enables bi-directional data transfer between Apache Spark and Neo4j graph databases. It allows users to read graph data from Neo4j into Spark DataFrames for distributed processing and write processed results back to Neo4j. This solves the problem of integrating graph database operations with large-scale data analytics pipelines.
Data engineers, data scientists, and developers working with both Apache Spark for big data processing and Neo4j for graph data storage, particularly those building ETL pipelines or performing graph analytics at scale.
Developers choose this connector because it provides a standardized, efficient way to integrate Neo4j with Spark's ecosystem using the DataSource API, eliminating the need for custom integration code. It supports multiple Spark versions and Scala variants, ensuring compatibility with existing Spark deployments.
Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables reading Neo4j data into Spark DataFrames and writing processed results back, facilitating seamless ETL pipelines and graph analytics workflows as highlighted in the key features.
Uses Spark's DataSource API for consistent, optimized data access patterns, allowing integration with existing Spark applications without custom code, per the philosophy.
Compatible with Spark 3.x and supports Scala 2.12 and 2.13, providing flexibility for various deployments, as shown in the building instructions and integration examples.
Can be integrated via JAR files, Spark Packages, or dependency managers like Maven and sbt, simplifying setup across different environments, as detailed in the README.
Documentation is hosted in a different repository (docs-spark), which can make it harder to access and maintain compared to integrated docs, potentially slowing down troubleshooting.
Specific versioning for Spark and Scala variants (e.g., _2.12 or _2.13) may lead to dependency conflicts in complex projects, requiring careful management as noted in the compatibility section.
Transferring data between Neo4j and Spark can introduce latency for large datasets, especially compared to in-memory processing, which might impact real-time or high-throughput use cases.