A scalable machine learning library for training Generalized Linear Models and GLMix models on Apache Spark.
Photon ML is a scalable machine learning library built on Apache Spark for training Generalized Linear Models (GLMs) and Generalized Linear Mixed Models (GLMMs). It solves the problem of large-scale response prediction, enabling personalized recommendations and ranking systems by efficiently handling models with hundreds of billions of coefficients.
Data scientists and machine learning engineers working on large-scale recommendation systems, ad targeting, or ranking problems within Spark ecosystems.
Developers choose Photon ML for its proven scalability at LinkedIn, specialized support for GLMix models, and seamless integration with Apache Spark, offering production-ready tools for personalized prediction tasks.
A scalable machine learning library on Apache Spark
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Implements the GAME algorithm to train Generalized Linear Mixed Models with hundreds of billions of coefficients, enabling per-user and per-item personalization as used in LinkedIn's recommendation systems.
Built on Apache Spark, it offers seamless deployment on Spark clusters and is proven in production at LinkedIn for large-scale machine learning tasks like ad CTR prediction and ranking.
Supports multiple GLM types (logistic, linear, Poisson), configurable optimizers (LBFGS, TRON), and L1/L2/elastic-net regularization for robust model training and overfitting prevention.
Features like warm-start training and partial retraining allow incremental learning and coefficient locking, saving computational resources by avoiding full retraining from scratch.
Primarily relies on Avro format; the README admits that support for other formats requires community contributions, which can be a barrier for teams using different data systems.
GAME driver setup is intricate with many parameters, and the legacy Photon driver is deprecated, leading to a steep learning curve and potential confusion for new users.
Features like smoothed hinge loss SVM and hyperparameter auto-tuning are labeled experimental and not fully tested, reducing their reliability for production environments.