Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go.
CloudForest is a machine learning library implementing ensembles of decision trees in pure Go. It is designed for high performance on heterogeneous numerical and categorical data with missing values, making it particularly suitable for domains like genetic and clinical studies. The library supports multiple algorithms including Random Forest, AdaBoost, Gradient Boosting, and Hellinger Distance Trees.
Go developers and data scientists working with heterogeneous datasets containing missing values, especially in fields like genomics and clinical research where high-dimensional, noisy data is common. Researchers needing specialized algorithms for unbalanced classes or feature selection.
Developers choose CloudForest for its optimized performance on modern CPUs with cache-friendly memory utilization and separate paths for different data types. It offers unique capabilities like robust missing value handling, artificial contrast feature selection (ACE), and specialized methods for unbalanced data that aren't commonly found in other implementations.
Ensembles of decision trees in go/golang.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Achieves faster training times than many implementations due to CPU cache-friendly memory utilization and separate, optimized paths for binary, numerical, and categorical data, as shown in benchmarks on heterogeneous clinical data.
Includes bias correction and three-way splitting by default to handle missing data without imputation, reducing bias towards features with lots of missing values as described in the Missing Values section.
Implements roughly balanced bagging, cost-weighted classification, and weighted Gini impurity for predicting rare events in unbalanced classes, which is critical for genetic and clinical studies.
Offers artificial contrasts with ensembles (ACE) for improved feature selection with p-values, helping combat overfitting in noisy, high-dimensional data as detailed in the Importance section.
Some features like three-way splitting for missing values (-splitmissing) are experimental and yield mixed results, as admitted in the README, which can lead to unpredictable performance.
Only supports TSV-based AFM, ARFF, and LibSVM formats, lacking native support for common formats like CSV or Parquet, which may require data conversion overhead.
Being pure Go, it lacks seamless integration with popular data science tools in Python or R, making it less accessible for teams not invested in the Go ecosystem.
With numerous advanced options and experimental algorithms, tuning for optimal performance requires deep expertise and careful parameter sweeps, increasing setup complexity.