How to read data from Neo4j into Spark using this connector?

Use the connector's DataSource API to specify Neo4j as a source in Spark SQL or DataFrames, typically by providing connection details like URI and credentials. Refer to the documentation for code examples and configuration options.

Can I write Spark DataFrames back to Neo4j with this connector?

Yes, the connector supports bi-directional data transfer, allowing you to write Spark DataFrames to Neo4j as nodes, relationships, or properties. Ensure proper schema mapping and use the write methods as per the docs.

What Spark versions are supported by the Neo4j connector?

It supports Spark 3.x with compatibility for both Scala 2.12 and 2.13, as indicated in the building instructions. Check the compatibility guide for specific version details and updates.

Neo4j Spark connector vs. custom integration: which is better?

The connector is better for standardized, maintainable integration using Spark's APIs, reducing custom code. Custom integration might be needed for highly specialized use cases not covered by the connector's features.

How to handle schema mapping between Neo4j and Spark?

The connector automatically infers schemas from Neo4j data, but you may need to configure mappings for complex graph structures. Review the documentation for best practices on property types and relationship handling.

Is there support for streaming data with Neo4j and Spark?

The connector primarily focuses on batch processing via DataFrames; for streaming, you might need to use Spark Streaming with custom logic, as it's not a core feature mentioned in the key aspects.

neo4j-spark-connector — Bi-Directional Neo4j Spark Connector

What is neo4j-spark-connector?

Neo4j Connector for Apache Spark is an open-source library that enables bi-directional data transfer between Apache Spark and Neo4j graph databases. It allows users to read graph data from Neo4j into Spark DataFrames for distributed processing and write processed results back to Neo4j. This solves the problem of integrating graph database operations with large-scale data analytics pipelines.

Target Audience

Data engineers, data scientists, and developers working with both Apache Spark for big data processing and Neo4j for graph data storage, particularly those building ETL pipelines or performing graph analytics at scale.

Value Proposition

Developers choose this connector because it provides a standardized, efficient way to integrate Neo4j with Spark's ecosystem using the DataSource API, eliminating the need for custom integration code. It supports multiple Spark versions and Scala variants, ensuring compatibility with existing Spark deployments.

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

Use Cases

Best For

Performing large-scale graph analytics using Spark's distributed computing capabilities
Building ETL pipelines that move data between Neo4j and Spark DataFrames
Integrating Neo4j graph data with machine learning workflows in Spark MLlib
Migrating or syncing data between Neo4j and other data stores via Spark
Running complex graph queries on Neo4j data using Spark SQL
Developing data applications that require both graph database and batch processing capabilities

Not Ideal For

Real-time applications requiring low-latency graph queries without batch processing overhead
Small projects where Neo4j's native Cypher queries suffice and Spark adds unnecessary complexity
Teams using non-JVM languages like Python exclusively without PySpark or heavy Scala/Java dependencies
Environments with strict dependency management that conflict with specific Spark or Scala versions

Pros & Cons

Pros

Bi-directional Data Transfer

Enables reading Neo4j data into Spark DataFrames and writing processed results back, facilitating seamless ETL pipelines and graph analytics workflows as highlighted in the key features.

Standard Spark Integration

Uses Spark's DataSource API for consistent, optimized data access patterns, allowing integration with existing Spark applications without custom code, per the philosophy.

Multi-Version Support

Compatible with Spark 3.x and supports Scala 2.12 and 2.13, providing flexibility for various deployments, as shown in the building instructions and integration examples.

Flexible Deployment Options

Can be integrated via JAR files, Spark Packages, or dependency managers like Maven and sbt, simplifying setup across different environments, as detailed in the README.

Cons

Separate Documentation

Documentation is hosted in a different repository (docs-spark), which can make it harder to access and maintain compared to integrated docs, potentially slowing down troubleshooting.

Version Complexity

Specific versioning for Spark and Scala variants (e.g., _2.12 or _2.13) may lead to dependency conflicts in complex projects, requiring careful management as noted in the compatibility section.

Performance Overhead

Transferring data between Neo4j and Spark can introduce latency for large datasets, especially compared to in-memory processing, which might impact real-time or high-throughput use cases.

Frequently Asked Questions

What is neo4j-spark-connector?

Target Audience

Value Proposition

Use Cases

Best For

Performing large-scale graph analytics using Spark's distributed computing capabilities
Building ETL pipelines that move data between Neo4j and Spark DataFrames
Integrating Neo4j graph data with machine learning workflows in Spark MLlib
Migrating or syncing data between Neo4j and other data stores via Spark
Running complex graph queries on Neo4j data using Spark SQL
Developing data applications that require both graph database and batch processing capabilities

Not Ideal For

Real-time applications requiring low-latency graph queries without batch processing overhead
Small projects where Neo4j's native Cypher queries suffice and Spark adds unnecessary complexity
Teams using non-JVM languages like Python exclusively without PySpark or heavy Scala/Java dependencies
Environments with strict dependency management that conflict with specific Spark or Scala versions

Pros & Cons

Pros

Bi-directional Data Transfer

Enables reading Neo4j data into Spark DataFrames and writing processed results back, facilitating seamless ETL pipelines and graph analytics workflows as highlighted in the key features.

Standard Spark Integration

Uses Spark's DataSource API for consistent, optimized data access patterns, allowing integration with existing Spark applications without custom code, per the philosophy.

Multi-Version Support

Compatible with Spark 3.x and supports Scala 2.12 and 2.13, providing flexibility for various deployments, as shown in the building instructions and integration examples.

Flexible Deployment Options

Can be integrated via JAR files, Spark Packages, or dependency managers like Maven and sbt, simplifying setup across different environments, as detailed in the README.

Cons

Separate Documentation

Documentation is hosted in a different repository (docs-spark), which can make it harder to access and maintain compared to integrated docs, potentially slowing down troubleshooting.

Version Complexity

Specific versioning for Spark and Scala variants (e.g., _2.12 or _2.13) may lead to dependency conflicts in complex projects, requiring careful management as noted in the compatibility section.

Performance Overhead

Transferring data between Neo4j and Spark can introduce latency for large datasets, especially compared to in-memory processing, which might impact real-time or high-throughput use cases.

Frequently Asked Questions

neo4j-spark-connector

What is neo4j-spark-connector?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Related Projects

Found a gem we're missing?

neo4j-spark-connector

What is neo4j-spark-connector?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Related Projects

Found a gem we're missing?