A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.
Apache Gobblin is a distributed data integration framework designed to simplify big data integration tasks such as data ingestion, replication, organization, and lifecycle management. It handles both streaming and batch data ecosystems, providing a scalable solution for managing structured and byte-oriented data across heterogeneous environments. The framework is optimized for ELT patterns with inline transformations and is battle-tested at petabyte scale in production environments.
Data engineers and architects working with large-scale data ecosystems who need reliable ingestion, replication, and lifecycle management across diverse data sources and sinks. Organizations with complex data integration requirements across Hadoop, cloud storage, and external APIs will benefit most.
Developers choose Apache Gobblin for its proven scalability in production environments, comprehensive feature set for data management, and flexibility in supporting both stream and batch execution modes. Its ability to handle petabyte-scale workflows while providing fault tolerance, data quality checking, and compliance management makes it a robust alternative to building custom integration solutions.
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Battle-tested at petabyte-scale by companies like LinkedIn and PayPal, ensuring reliability for large data workflows as highlighted in the README.
Offers end-to-end capabilities including ingestion, compaction, deduplication, and lifecycle management, covering complex data integration needs from the README.
Supports both stream and batch execution, allowing adaptable data processing workflows, as noted in the highlights section.
Includes features like task partitioning, state management, and atomic data publishing, enhancing reliability in distributed environments per the README.
Delegates complex data processing to external systems like Spark, adding dependency and overhead, as admitted in the 'Apache Gobblin is NOT' section.
Building from source requires Gradle and Maven with non-trivial instructions, which can be daunting for new users, as seen in the build requirements.
Best suited for Hadoop or cloud storage environments, limiting appeal for modern, cloud-native setups without extensive integration work.
Being in the Apache Incubator may imply ongoing development and potential breaking changes, affecting stability for production use.