An open-source framework for developing large-scale anomaly detection models using Apache Spark.
Yurita is an open-source anomaly detection framework developed by PayPal for building large-scale statistical models. It enables developers and data scientists to detect outliers in data streams using configurable pipelines and Apache Spark for distributed processing. The framework solves the problem of identifying anomalies in high-volume, time-series data common in monitoring, fraud detection, and security applications.
Data engineers and data scientists working with big data platforms who need scalable, customizable anomaly detection solutions. It is particularly suited for teams using Apache Spark and requiring real-time or batch analysis of temporal data.
Developers choose Yurita for its seamless integration with Apache Spark, modular pipeline design, and production-ready scalability. Its open-source nature and PayPal-backed development provide reliability and flexibility compared to proprietary or less extensible alternatives.
Anomaly detection framework @ PayPal
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Apache Spark for distributed processing of large datasets, making it production-ready for big data workflows as emphasized in the README's focus on large-scale models.
Uses a builder pattern for configurable statistical methods like categorical averaging and entropy, allowing data scientists to compose custom anomaly detection workflows, as shown in the sample code.
Supports fixed and sliding window definitions for time-series analysis, which is critical for monitoring and security applications dealing with streaming data.
Handles both categorical and numerical columns with tailored functions, enhancing versatility in detecting anomalies across diverse data types as highlighted in the key features.
Requires building from source with Gradle, as the artifact is not yet available on Maven Central, and dependencies on specific Spark versions (e.g., 2.4.1) add configuration overhead and potential compatibility issues.
Focuses on statistical models like categorical averaging and entropy, lacking built-in support for machine learning or deep learning algorithms, which may not suffice for complex or modern anomaly detection tasks.
Documentation is hosted on ReadTheDocs, but the project is still evolving with pending Maven Central availability, indicating potential instability, incomplete features, or reliance on community contributions for advanced use cases.