An experimental integration of Apache Samza and Luwak for scalable real-time document matching against large query sets.
Samza-Luwak is an experimental integration that combines Apache Samza's distributed stream processing capabilities with Luwak's optimized Lucene-based query matching. It enables scalable real-time matching of streaming documents against large sets of stored search queries, addressing use cases like media monitoring or alert systems. The project explores partitioning and pipeline architectures to handle high throughput and query volumes efficiently.
Developers and engineers building scalable real-time search or alerting systems, particularly those dealing with high-volume document streams and large query sets. It's also relevant for researchers or teams experimenting with distributed stream processing and search technologies.
It offers a distributed, fault-tolerant architecture for real-time document-query matching that can scale to handle both high document throughput and large numbers of complex queries. Unlike some existing solutions, it focuses on performance optimizations through partitioning strategies and leverages the strengths of Samza for stream processing.
Integration of Samza and Luwak
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Apache Samza for distributed, fault-tolerant processing, enabling high-throughput document matching across a cluster as described in the architecture.
Integrates Luwak library, optimized for matching documents against hundreds of thousands of queries, using Lucene's capabilities with performance enhancements.
Supports real-time addition, modification, and removal of queries via a dedicated input stream, allowing flexible alerting systems.
Investigates strategies like query-set partitioning and multi-stage pipelines to optimize performance for scalable real-time search.
The README explicitly states it's a proof-of-concept, 'very hacky and experimental, and may not work at all,' making it risky for any serious use.
Requires manual building and installation of multiple dependencies, including forks like Lucene with unreleased components, and has compatibility issues such as not working with JDK8.
Only handles streaming documents and cannot match against historical data, limiting its use to real-time-only scenarios as noted in the README.
samza-luwak is an open-source alternative to the following products: