A library for writing MapReduce programs that execute on distributed platforms like Storm and Scalding using Scala/Java collection-like syntax.
Summingbird is a library that lets developers write MapReduce programs using Scala or Java collection-like syntax, which can then be executed on distributed platforms like Storm (for real-time processing) and Scalding (for batch processing). It solves the problem of maintaining separate codebases for batch and streaming data pipelines by providing a unified API.
Data engineers and Scala/Java developers building large-scale, fault-tolerant data processing systems that require both batch and real-time capabilities.
Developers choose Summingbird because it abstracts away the complexities of distributed platforms, allows code reuse across batch and streaming contexts, and provides strong fault-tolerance guarantees through its hybrid execution mode.
Streaming MapReduce with Scalding and Storm
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables writing data processing logic once and running it on both batch (Scalding) and real-time (Storm) engines, as demonstrated in the word count example where the same code works for multiple modes.
Uses Scala/Java collection transformations like flatMap and sumByKey, making code intuitive for developers; the README shows how Summingbird code closely mirrors native Scala collections.
Supports hybrid execution that combines batch and real-time processing for robust data consistency, offering production-ready primitives to build fault-tolerant systems as highlighted in the features.
Abstracts underlying platforms such as Storm and Scalding, allowing business logic to remain decoupled from execution engines, which simplifies code reuse across different processing contexts.
Marked as 'retired' in the README badge, indicating no active development or support, which poses significant risks for long-term project viability and bug fixes.
Getting started requires installing and configuring multiple external services like Memcached and Storm, as detailed in the example setup, adding overhead compared to more integrated solutions.
Only integrates with Storm and Scalding, excluding newer and more widely adopted frameworks like Apache Spark or Apache Flink, which limits flexibility and ecosystem benefits.
Exclusively supports Scala and Java, with no support for other programming languages, restricting adoption for teams using diverse tech stacks.