A fast feature selection algorithm for tree-based models like XGBoost, designed to outperform Boruta in speed and performance.
BoostARoota is a fast feature selection algorithm designed for tree-based machine learning models like XGBoost. It solves the problem of slow and inefficient feature selection in high-dimensional datasets by creating shadow features and iteratively removing unimportant variables. The algorithm significantly reduces computation time while maintaining or improving model performance compared to alternatives like Boruta.
Data scientists and machine learning engineers working with high-dimensional datasets who need efficient feature selection for tree-based models like XGBoost or sklearn classifiers.
Developers choose BoostARoota because it offers a 100x speed improvement over Boruta, supports flexible sklearn integration, and provides robust feature importance estimation through configurable parameters. Its optimized removal process ensures better performance on modern algorithms where traditional methods fall short.
A fast xgboost feature selection algorithm
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages parallel processing to run up to 100x faster than Boruta, as shown in benchmark tests on datasets like LSVT with runtime reductions from 50 seconds to under 0.5 seconds.
Uses XGBoost as the default base model for robust feature importance estimation, specifically addressing poor performance of traditional methods on boosting algorithms.
Supports any sklearn tree-based classifier, such as ExtraTreesClassifier, allowing easy integration into existing machine learning workflows without locking into XGBoost.
Parameters like cutoff and delta let users balance feature removal rates, adapting to different dataset needs, with defaults optimized for wide applicability.
Requires input data to be one-hot-encoded, which can cause dataframes to explode in size if numeric variables are misinterpreted as categorical, leading to memory issues and unexpected results.
For multi-class classification, only the 'mlogloss' evaluation metric is supported, restricting flexibility compared to other metrics available in XGBoost or sklearn.
When using non-default classifiers, optimal parameters like cutoff and iterations require manual experimentation, as the README admits these haven't been fully tested and need user trial and error.
The significant speed improvements rely on multicore processing; on single-core systems or without parallel capabilities, the performance gains may be less dramatic than advertised.
BoostARoota is an open-source alternative to the following products: