How does Gobblin compare to Apache NiFi for data ingestion?

Gobblin is optimized for large-scale, batch-oriented ELT patterns and petabyte workflows, while NiFi offers a GUI-based flow design for real-time data routing. Gobblin excels in data lake integration and compliance, whereas NiFi is more user-friendly for streaming pipelines.

How to set up Gobblin for Kafka to S3 ingestion?

Configure a Gobblin job with Kafka as the source and S3 as the sink, specifying topics and output paths in the job config. The documentation provides sample configurations and guides for common patterns like this.

What are the performance benchmarks for Gobblin at scale?

Gobblin is proven in production at companies like Verizon for petabyte-scale data, but specific benchmarks are not provided in the README. Performance depends on factors like cluster size and data volume, with features like task partitioning aiding scalability.

Does Gobblin support real-time streaming or just microbatch?

Gobblin supports both streaming and batch modes, but its streaming is often microbatch-oriented for data lake ingestion. For low-latency real-time processing, it may delegate to systems like Kafka Streams.

How does Gobblin handle data quality checking?

It includes built-in data quality features like validation and checking during ingestion, as mentioned in the highlights. Configurations allow setting thresholds and rules to ensure data integrity across workflows.

Is Gobblin suitable for cloud-native deployments?

Yes, it supports cloud storage like AWS S3 and Azure ADLS, and the control plane can be deployed in containerized environments. However, setup may require customization for fully serverless architectures.

Gobblin from LinkedIn — LinkedIn's Data Integration Framework

What is Gobblin from LinkedIn?

Apache Gobblin is a distributed data integration framework that simplifies big data integration tasks such as data ingestion, replication, organization, and lifecycle management. It supports both streaming and batch data ecosystems, enabling seamless data movement across heterogeneous sources and sinks like HDFS, S3, Kafka, and external APIs. The framework is optimized for ELT patterns and handles petabyte-scale data workflows in production environments.

Target Audience

Data engineers and architects working in big data ecosystems who need to manage data integration, replication, and lifecycle tasks across diverse data sources and storage systems. It is particularly suited for organizations with large-scale data lakes and compliance requirements.

Value Proposition

Developers choose Apache Gobblin for its battle-tested scalability, support for both stream and batch execution, and comprehensive feature set including fault tolerance, data quality checking, and a control plane for orchestration. It integrates with existing data systems without replacing them, focusing specifically on data integration and management tasks.

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Use Cases

Best For

Streaming and batch ingestion from Kafka to data lakes like HDFS or S3
Bulk-loading serving stores from data lakes (e.g., HDFS to Couchbase)
Synchronizing data across federated data lakes (e.g., HDFS to S3)
Integrating external vendor APIs (e.g., Salesforce) with data stores
Enforcing data retention policies and GDPR compliance deletions
Managing data lifecycle and organization in heterogeneous data ecosystems

Not Ideal For

Projects requiring complex data transformations or real-time analytics, as Gobblin focuses on ELT with inline transformations and delegates heavy processing to systems like Spark.
Teams needing a general-purpose workflow orchestrator for non-data integration tasks, since Gobblin is not designed for broad workflow execution like Airflow.
Small-scale data integration use cases where simpler ETL tools or scripts would be more cost-effective and easier to manage.

Pros & Cons

Pros

Proven at Scale

Battle-tested at petabyte scale in production at companies like LinkedIn and PayPal, as highlighted in the README, ensuring reliability for large deployments.

Stream and Batch Support

Supports both execution modes, enabling flexible data processing patterns such as Kafka ingestion and bulk loading, as mentioned in the capabilities.

Comprehensive Data Management

Handles ingestion, organization, lifecycle, and compliance tasks like GDPR deletions, providing end-to-end data workflow solutions.

Fault Tolerance Features

Includes task partitioning, state management for incremental processing, and atomic data publishing, ensuring robustness in heterogeneous ecosystems.

Cons

Complex Setup Process

Building from source requires manual gradle wrapper downloads and specific Java/Maven versions, as per the instructions, adding initial overhead.

Limited Transformation Engine

Admitted in the README as not a general-purpose data transformation engine, so complex ETL needs integration with external systems like Spark.

Steep Learning Curve

Configuring and managing diverse data sources and sinks can be challenging due to the framework's extensive feature set and heterogeneous support.

Frequently Asked Questions

What is Gobblin from LinkedIn?

Target Audience

Value Proposition

Use Cases

Best For

Streaming and batch ingestion from Kafka to data lakes like HDFS or S3
Bulk-loading serving stores from data lakes (e.g., HDFS to Couchbase)
Synchronizing data across federated data lakes (e.g., HDFS to S3)
Integrating external vendor APIs (e.g., Salesforce) with data stores
Enforcing data retention policies and GDPR compliance deletions
Managing data lifecycle and organization in heterogeneous data ecosystems

Not Ideal For

Projects requiring complex data transformations or real-time analytics, as Gobblin focuses on ELT with inline transformations and delegates heavy processing to systems like Spark.
Teams needing a general-purpose workflow orchestrator for non-data integration tasks, since Gobblin is not designed for broad workflow execution like Airflow.
Small-scale data integration use cases where simpler ETL tools or scripts would be more cost-effective and easier to manage.

Pros & Cons

Pros

Proven at Scale

Battle-tested at petabyte scale in production at companies like LinkedIn and PayPal, as highlighted in the README, ensuring reliability for large deployments.

Stream and Batch Support

Supports both execution modes, enabling flexible data processing patterns such as Kafka ingestion and bulk loading, as mentioned in the capabilities.

Comprehensive Data Management

Handles ingestion, organization, lifecycle, and compliance tasks like GDPR deletions, providing end-to-end data workflow solutions.

Fault Tolerance Features

Includes task partitioning, state management for incremental processing, and atomic data publishing, ensuring robustness in heterogeneous ecosystems.

Cons

Complex Setup Process

Building from source requires manual gradle wrapper downloads and specific Java/Maven versions, as per the instructions, adding initial overhead.

Limited Transformation Engine

Admitted in the README as not a general-purpose data transformation engine, so complex ETL needs integration with external systems like Spark.

Steep Learning Curve

Configuring and managing diverse data sources and sinks can be challenging due to the framework's extensive feature set and heterogeneous support.

Frequently Asked Questions

Gobblin from LinkedIn

What is Gobblin from LinkedIn?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Found a gem we're missing?

Gobblin from LinkedIn

What is Gobblin from LinkedIn?

Overview

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions

Found a gem we're missing?