Python implementation of the Boruta all-relevant feature selection method with scikit-learn compatibility.
BorutaPy is a Python implementation of the Boruta all-relevant feature selection algorithm that identifies all features carrying information usable for prediction. It provides a scikit-learn compatible interface for selecting features that contribute to understanding data-generating phenomena, rather than just finding a minimal optimal subset for a specific classifier.
Data scientists, machine learning engineers, and researchers working with high-dimensional datasets who need comprehensive feature selection for model interpretation and understanding underlying data patterns.
Developers choose BorutaPy because it offers a more comprehensive approach to feature selection than minimal-optimal methods, provides a familiar scikit-learn interface, includes performance improvements over the original R implementation, and offers flexible statistical corrections suitable for various data types including biological data.
Python implementations of the Boruta all-relevant feature selection method.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Identifies all features that carry predictive information, not just a minimal set, which aligns with its philosophy of understanding data-generating phenomena as stated in the README.
Provides a fit/transform interface identical to scikit-learn methods, making it easy to incorporate into existing pipelines, as demonstrated in the example code with RandomForestClassifier.
Offers adjustable percentile thresholds (perc parameter) and a two-step correction process, allowing users to control false discovery rates more appropriately for various data types like biological datasets.
Leverages scikit-learn's efficient ensemble implementations, resulting in faster run times compared to the original R package, as highlighted in the key features.
Only compatible with scikit-learn estimators that have a feature_importances_ attribute, primarily tree-based models, which restricts model choice and may not work with all algorithms.
The iterative process requires fitting multiple random forest models repeatedly, which can be prohibitively slow for large or high-dimensional datasets, impacting scalability.
Parameters like perc and alpha require careful tuning and statistical knowledge to balance feature selection accuracy, adding complexity for non-expert users as noted in the description of trade-offs.