A Spark Streaming library for mining big data streams with incremental learning algorithms.
streamDM is a library for mining big data streams using Spark Streaming. It provides implementations of stream learning algorithms that can handle data where distributions may change over time and examples must be processed efficiently with minimal memory footprint. The library addresses the unique challenges of stream mining compared to traditional batch learning.
Data scientists and engineers working with real-time data streams who need to perform machine learning tasks like classification, clustering, and regression on continuously arriving data.
streamDM offers a collection of theoretically sound stream mining algorithms integrated with Spark Streaming, providing scalability and efficiency for big data stream processing. It fills a gap in the Spark ecosystem by focusing specifically on incremental learning methods suitable for dynamic data streams.
Stream Data Mining Library for Spark Streaming
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Spark Streaming for horizontal scaling on big data streams, enabling efficient processing of RDD-based data sequences as described in the README.
Includes methods like Hoeffding Decision Trees with theoretical guarantees for stream learning, addressing data distribution changes over time as highlighted in the project philosophy.
Provides synthetic data generators (e.g., HyperplaneGenerator) for testing and simulation, facilitating model validation without external data sources.
Designed specifically for stream data where examples are processed once with minimal memory, overcoming batch learning limitations as outlined in the big data stream learning section.
Missing key methods like regression and advanced clustering, with future plans (e.g., Hoeffding Regression Tree) still unimplemented, reducing versatility for diverse use cases.
Tied to Spark 2.3.2 and Scala 2.11, which are outdated and may not integrate seamlessly with newer Spark versions or other Scala ecosystems.
Requires full Spark Streaming environment configuration, including Java 8+ and SBT, which adds significant setup complexity compared to lighter stream mining libraries.