A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.
Isolation Forest is a distributed Spark/Scala implementation of the Isolation Forest and Extended Isolation Forest algorithms for unsupervised outlier detection. It solves the problem of identifying anomalies in large-scale datasets by providing scalable training and inference capabilities. The library includes support for ONNX export, enabling cross-platform deployment of trained models.
Data engineers and machine learning practitioners working with big data platforms like Apache Spark who need scalable anomaly detection solutions. It's also suitable for researchers implementing Isolation Forest variants in distributed environments.
Developers choose this library because it offers production-ready, distributed implementations of both standard and extended Isolation Forest algorithms with native Spark integration. Its ONNX export capability provides unique portability for inference across different platforms and languages.
A distributed Spark/Scala implementation of the isolation forest and extended isolation forest algorithms for unsupervised outlier detection, featuring support for scalable training and ONNX export for easy cross-platform inference.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Spark's ML library for scalable training and scoring on large datasets, with support for pipelines and HDFS persistence, as shown in the distributed training examples.
Includes Extended Isolation Forest with random hyperplane splits to eliminate axis-aligned bias, improving detection in correlated feature spaces, evidenced by the heatmap comparisons and benchmarks.
Supports ONNX export for standard Isolation Forest models, enabling portable inference across languages and platforms via the Python-based converter detailed in the README.
Inherits from Spark's Estimator and Model classes for seamless integration into existing ML workflows, with model persistence and parameter tuning features.
ONNX export is only available for the standard Isolation Forest, not the Extended variant, restricting portability for the more advanced algorithm as admitted in the README.
Requires building with Gradle and managing Spark dependencies, making it cumbersome for teams not already invested in the Spark ecosystem, with no pre-built binaries for quick adoption.
Benchmarks show Extended Isolation Forest can underperform on some datasets like ForestCover or Mulcross, and it introduces additional computational cost without guaranteed improvements.