A distributed stream processing framework built on Apache Kafka and Apache Hadoop YARN for fault-tolerant, stateful processing.
Apache Samza is a distributed stream processing framework that enables developers to build stateful, high-throughput applications for real-time data processing. It solves the challenge of processing continuous data streams with strong guarantees around fault tolerance, durability, and scalability by leveraging Apache Kafka for messaging and Apache Hadoop YARN for resource management.
Data engineers and developers building real-time analytics pipelines, event-driven applications, and stateful stream processing systems that require reliable, scalable data processing with managed state.
Developers choose Samza for its simple API, robust managed state capabilities, and seamless integration with the Kafka ecosystem, providing a production-ready framework that handles complex distributed systems concerns like fault tolerance and scalability out of the box.
Mirror of Apache Samza
Provides a straightforward callback-based interface similar to MapReduce, making it accessible for developers, as highlighted in the README's emphasis on simplicity.
Automatically handles snapshotting and restoration of processor state for large sizes up to gigabytes per partition, ensuring consistency during failures, per the key features.
Integrates with YARN to transparently migrate tasks during cluster failures, offering high availability without manual intervention, as described in the README.
Leverages Kafka for ordered, partitioned message processing with no data loss, providing durability guarantees out of the box, as specified in the features.
Requires specific versions of Java, Scala, and YARN, with some modules not supporting Java 11, adding setup hurdles and compatibility issues, as noted in the README.
Core functionality depends on Apache YARN for resource management, limiting deployment flexibility in non-Hadoop or cloud-native environments.
Building from source involves multiple Gradle commands and environment configurations, which can be cumbersome for new users, as seen in the build instructions.
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
Incremental engine for long horizon agents 🌟 Star if you like it!
Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.
Distributed stream processing engine in Rust
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.