Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Hadoop
  3. Gobblin from LinkedIn

Gobblin from LinkedIn

Apache-2.0Javagobblin_0.11.0

A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.

Visit WebsiteGitHubGitHub
2.3k stars749 forks0 contributors

What is Gobblin from LinkedIn?

Apache Gobblin is a distributed data integration framework that simplifies big data integration tasks such as data ingestion, replication, organization, and lifecycle management. It supports both streaming and batch data ecosystems, enabling seamless data movement across heterogeneous sources and sinks like HDFS, S3, Kafka, and external APIs. The framework is optimized for ELT patterns and handles petabyte-scale data workflows in production environments.

Target Audience

Data engineers and architects working in big data ecosystems who need to manage data integration, replication, and lifecycle tasks across diverse data sources and storage systems. It is particularly suited for organizations with large-scale data lakes and compliance requirements.

Value Proposition

Developers choose Apache Gobblin for its battle-tested scalability, support for both stream and batch execution, and comprehensive feature set including fault tolerance, data quality checking, and a control plane for orchestration. It integrates with existing data systems without replacing them, focusing specifically on data integration and management tasks.

Overview

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Use Cases

Best For

  • Streaming and batch ingestion from Kafka to data lakes like HDFS or S3
  • Bulk-loading serving stores from data lakes (e.g., HDFS to Couchbase)
  • Synchronizing data across federated data lakes (e.g., HDFS to S3)
  • Integrating external vendor APIs (e.g., Salesforce) with data stores
  • Enforcing data retention policies and GDPR compliance deletions
  • Managing data lifecycle and organization in heterogeneous data ecosystems

Not Ideal For

  • Projects requiring complex data transformations or real-time analytics, as Gobblin focuses on ELT with inline transformations and delegates heavy processing to systems like Spark.
  • Teams needing a general-purpose workflow orchestrator for non-data integration tasks, since Gobblin is not designed for broad workflow execution like Airflow.
  • Small-scale data integration use cases where simpler ETL tools or scripts would be more cost-effective and easier to manage.

Pros & Cons

Pros

Proven at Scale

Battle-tested at petabyte scale in production at companies like LinkedIn and PayPal, as highlighted in the README, ensuring reliability for large deployments.

Stream and Batch Support

Supports both execution modes, enabling flexible data processing patterns such as Kafka ingestion and bulk loading, as mentioned in the capabilities.

Comprehensive Data Management

Handles ingestion, organization, lifecycle, and compliance tasks like GDPR deletions, providing end-to-end data workflow solutions.

Fault Tolerance Features

Includes task partitioning, state management for incremental processing, and atomic data publishing, ensuring robustness in heterogeneous ecosystems.

Cons

Complex Setup Process

Building from source requires manual gradle wrapper downloads and specific Java/Maven versions, as per the instructions, adding initial overhead.

Limited Transformation Engine

Admitted in the README as not a general-purpose data transformation engine, so complex ETL needs integration with external systems like Spark.

Steep Learning Curve

Configuring and managing diverse data sources and sinks can be challenging due to the framework's extensive feature set and heterogeneous support.

Frequently Asked Questions

Quick Stats

Stars2,264
Forks749
Contributors0
Open Issues0
Last commit2 days ago
CreatedSince 2014

Tags

#stream-processing#apache#batch-processing#replication#data-integration#data-replication#elt#apache-project#data-lake#management#big-data#data#data-ingestion

Built With

J
Java
D
Docker
G
Gradle

Links & Resources

Website

Included in

Hadoop1.1k
Auto-fetched 1 day ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub