A Python framework for creating reproducible, maintainable, and modular data engineering and data science pipelines.
Kedro is an open-source Python framework that applies software engineering best practices to data science workflows. It helps teams build production-ready data pipelines that are reproducible, maintainable, and modular, addressing common shortcomings of Jupyter notebooks and one-off scripts.
Data scientists and data engineers working in teams to create production-ready data and machine learning pipelines that require maintainable, collaborative code.
Developers choose Kedro for its standardized project structure and pipeline abstraction, which enforces modularity and automatic dependency resolution, bridging the gap between data engineering and data science with built-in tools for testing, documentation, and deployment.
Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses a Cookiecutter-based template to enforce consistent project setup, reducing onboarding time and improving team collaboration as highlighted in the README.
Leverages pure Python functions to automatically resolve pipeline dependencies, minimizing manual errors and enabling clear visualization with Kedro-Viz.
Data Catalog includes versioning for file-based systems, supporting reproducible data workflows and model tracking as mentioned in the features.
Comes with pytest for test-driven development and ruff for linting, enforcing coding standards out of the box to maintain code quality.
Supports deployment on various platforms like AWS Batch, Databricks, and Kubernetes with Argo, offering flexibility for production environments.
Requires adoption of software engineering practices, which can be challenging for data scientists accustomed to Jupyter notebooks or quick scripts.
Primarily designed for batch processing pipelines; real-time streaming capabilities are not a core feature, as indicated by the deployment strategies focused on scheduled jobs.
The split between kedro and kedro-datasets with differing Python version policies can lead to compatibility issues and extra maintenance overhead.