A book teaching practical patterns for building scalable and reliable distributed machine learning systems using Kubernetes, TensorFlow, Kubeflow, and Argo Workflows.
Distributed Machine Learning Patterns is a repository containing references and code for the book of the same name, which teaches how to scale machine learning models from personal devices to large distributed clusters. It provides practical patterns and techniques for building production-ready, cloud-native ML systems, addressing common challenges like distributed training, failure handling, and dynamic serving.
Data analysts, data scientists, and software engineers familiar with the basics of machine learning algorithms and running ML in production, who need to design scalable distributed systems. Readers should have basic knowledge of Bash, Python, and Docker.
Developers choose this for its pattern-based approach to designing distributed ML systems, offering repeatable solutions that balance scalability, reliability, and maintainability. It provides real-world examples and hands-on guidance using industry-standard tools like Kubernetes, Kubeflow, and Argo Workflows directly from a key maintainer and contributor.
Distributed Machine Learning Patterns from Manning Publications by Yuan Tang https://bit.ly/2RKv8Zo
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Written by Yuan Tang, a key maintainer of Kubeflow and Argo, providing firsthand insights into best practices and future trends in cloud-native ML.
Covers end-to-end workflows from data ingestion to serving, with real-world scenarios that illustrate trade-offs, helping practitioners design robust systems.
Leverages industry-standard tools like Kubernetes, Kubeflow, and Argo Workflows, ensuring relevance for modern, scalable ML infrastructure deployments.
Includes executable code snippets in the repository that demonstrate pattern implementation, bridging theoretical concepts with hands-on practice.
Heavily centered on the Kubernetes ecosystem and TensorFlow, with limited guidance for serverless platforms or alternative frameworks like PyTorch.
Assumes prior knowledge of Docker, Bash, and production ML, making it less accessible for those new to distributed systems without additional learning.
Implementing these patterns requires significant setup and maintenance of distributed clusters, which can be resource-intensive for small teams.