Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Streaming
  3. Apache Samza

Apache Samza

Apache-2.0Java

A distributed stream processing framework built on Apache Kafka and Apache Hadoop YARN for fault-tolerant, stateful processing.

GitHubGitHub
843 stars332 forks0 contributors

What is Apache Samza?

Apache Samza is a distributed stream processing framework that enables developers to build stateful, high-throughput applications for real-time data processing. It solves the challenge of processing continuous data streams with strong guarantees around fault tolerance, durability, and scalability by leveraging Apache Kafka for messaging and Apache Hadoop YARN for resource management.

Target Audience

Data engineers and developers building real-time analytics pipelines, event-driven applications, and stateful stream processing systems that require reliable, scalable data processing with managed state.

Value Proposition

Developers choose Samza for its simple API, robust managed state capabilities, and seamless integration with the Kafka ecosystem, providing a production-ready framework that handles complex distributed systems concerns like fault tolerance and scalability out of the box.

Overview

Mirror of Apache Samza

Use Cases

Best For

  • Building real-time analytics pipelines with exactly-once processing semantics
  • Developing stateful stream processing applications that require large, managed state
  • Creating fault-tolerant event-driven architectures using Apache Kafka
  • Implementing scalable data processing jobs on Hadoop YARN clusters
  • Processing high-volume data streams with strong durability guarantees
  • Building pluggable stream processing systems that can integrate with multiple messaging backends

Not Ideal For

  • Environments not using Apache Kafka or Hadoop YARN, as Samza's core architecture is optimized for these technologies.
  • Stateless stream processing tasks where simpler frameworks like Apache Kafka Streams might suffice without the overhead.
  • Cloud-native serverless deployments where managed services like AWS Lambda or Google Cloud Dataflow are preferred over self-managed YARN clusters.
  • Projects requiring rapid prototyping with minimal setup, due to Samza's complex build process and dependency management.

Pros & Cons

Pros

Simple Streaming API

Provides a straightforward callback-based interface similar to MapReduce, making it accessible for developers, as highlighted in the README's emphasis on simplicity.

Managed State Recovery

Automatically handles snapshotting and restoration of processor state for large sizes up to gigabytes per partition, ensuring consistency during failures, per the key features.

Strong Fault Tolerance

Integrates with YARN to transparently migrate tasks during cluster failures, offering high availability without manual intervention, as described in the README.

Seamless Kafka Integration

Leverages Kafka for ordered, partitioned message processing with no data loss, providing durability guarantees out of the box, as specified in the features.

Cons

Complex Dependency Management

Requires specific versions of Java, Scala, and YARN, with some modules not supporting Java 11, adding setup hurdles and compatibility issues, as noted in the README.

Heavy YARN Reliance

Core functionality depends on Apache YARN for resource management, limiting deployment flexibility in non-Hadoop or cloud-native environments.

Steep Initial Setup

Building from source involves multiple Gradle commands and environment configurations, which can be cumbersome for new users, as seen in the build instructions.

Frequently Asked Questions

Quick Stats

Stars843
Forks332
Contributors0
Open Issues0
Last commit24 days ago
CreatedSince 2015

Tags

#real-time-analytics#fault-tolerance#scalable-architecture#java framework#scala#big-data#data-pipelines#apache-kafka#state-management

Built With

S
Scala
J
Java
A
Apache Kafka
G
Gradle

Included in

Streaming3.0k
Auto-fetched 1 day ago

Related Projects

PathwayPathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

Stars63,065
Forks1,679
Last commit1 day ago
CocoIndexCocoIndex

Incremental engine for long horizon agents 🌟 Star if you like it!

Stars10,215
Forks801
Last commit1 day ago
RisingWaveRisingWave

Event streaming platform for agentic AI. Continuously ingest, transform, and serve event streams in real time, at scale.

Stars9,067
Forks775
Last commit1 day ago
ArroyoArroyo

Distributed stream processing engine in Rust

Stars4,933
Forks361
Last commit3 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub