Question 1

How do I set up Nessie with Apache Spark for data processing?

Accepted Answer

To integrate Nessie with Spark, configure Spark to use Nessie as the catalog for Iceberg tables by setting properties like 'spark.sql.catalog.nessie.catalog-impl' to 'org.apache.iceberg.nessie.NessieCatalog'. Detailed steps are available in the Nessie documentation for Spark via Iceberg, which includes examples for different Spark versions.

Question 2

Nessie vs Delta Lake: which is better for data versioning?

Accepted Answer

Nessie provides Git-like semantics with branching and merging at the catalog level, ideal for collaborative data development across multiple tables. Delta Lake offers table-level versioning with time travel but lacks full branching capabilities. Choose Nessie if you need complex version control workflows similar to Git; Delta Lake may suffice for simpler, table-specific versioning.

Question 3

Can I use Nessie with Hive if I'm not using Iceberg tables?

Accepted Answer

No, Nessie's integration with Hive is specifically through Iceberg, so you need to configure Hive to work with Iceberg tables to leverage Nessie. If you're using other Hive table formats, Nessie won't provide version control features without migrating to Iceberg.

Question 4

How to enable authentication in a self-hosted Nessie Docker instance?

Accepted Answer

Enable authentication by setting environment variables when running the Docker image, such as 'NESSIE_SERVER_AUTHENTICATION_ENABLED=true', 'QUARKUS_OIDC_CLIENT_ID', and 'QUARKUS_OIDC_AUTH_SERVER_URL'. This configures OpenID Connect for bearer token validation, as detailed in the README's authentication section.

Question 5

What are the performance impacts of using Nessie in a data lake?

Accepted Answer

Nessie adds a transactional layer that can introduce latency due to versioning and commit operations, especially in high-throughput scenarios. However, it's optimized for scalability with distributed systems; performance depends on factors like network latency and the underlying storage, so benchmarking with your workload is recommended.

Question 6

Is Nessie suitable for real-time data pipelines with low latency requirements?

Accepted Answer

Nessie is designed for batch-oriented data lakes and may not be ideal for real-time pipelines requiring sub-second latency, as the transactional and versioning overhead can slow down operations. For real-time use cases, consider lighter-weight catalogs or evaluate Nessie's impact on your specific latency thresholds.

Project Nessie

What is Project Nessie?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions