An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
Apache Hudi is an open data lakehouse platform that provides transactional capabilities for big data workloads. It enables efficient upserts, deletes, and incremental data processing on data lakes, allowing users to build real-time data pipelines with time-travel and change data capture features.
Data engineers and platform teams building and managing large-scale data lakes or lakehouses, particularly those needing incremental processing, ACID transactions, and efficient data management on cloud storage.
Developers choose Hudi for its ability to bring database-like features (upserts, deletes, transactions) to data lakes, enabling incremental pipelines, reducing ETL complexity, and providing time-travel and CDC capabilities without proprietary lock-in.
Upserts, Deletes And Incremental Processing on Big Data.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Hudi provides fast record-level operations using built-in indexing, enabling database-like transactions on data lakes as highlighted in the Upsert/Delete Support feature.
It offers incremental queries to process only changed data since a point in time, reducing ETL complexity and improving pipeline efficiency, as described in the Incremental Processing section.
Automatic compaction, clustering, and cleaning with configurable scheduling ensure optimal data layout and storage, detailed in the Table Management features.
Seamlessly works with Apache Spark, Flink, and various query engines, allowing flexible data processing across platforms, as noted in the Multi-Engine Support.
Building from source requires specific Java versions, Maven profiles for different Spark/Flink versions, and detailed configuration, which can be daunting and error-prone for new users.
Primarily designed for JVM-based frameworks like Spark and Flink, limiting ease of adoption for teams preferring lightweight, non-JVM ecosystems without additional integration work.
The indexing and transactional features add computational and storage overhead that may be unnecessary for append-only or low-throughput pipelines, impacting cost-effectiveness.