A Python framework for building and deploying serverless data and ML pipelines on AWS using AWS CDK.
Datajob is a Python framework built on AWS CDK that enables developers to define, deploy, and run serverless data and machine learning pipelines on AWS with minimal effort. It abstracts the underlying AWS services like Glue, Step Functions, and SageMaker, allowing users to focus on pipeline logic rather than infrastructure configuration. The framework handles resource provisioning, orchestration, and deployment automatically.
Data engineers and developers building ETL/ELT pipelines, machine learning workflows, or batch processing jobs on AWS who want a simplified, code-centric alternative to manual AWS console configuration or complex CloudFormation templates.
Developers choose Datajob because it drastically reduces the boilerplate and complexity of deploying serverless data pipelines on AWS. Its intuitive Python API and high-level abstractions over AWS CDK make pipeline definition and deployment significantly faster and more maintainable compared to writing raw CloudFormation or CDK constructs.
Build and deploy a serverless data pipeline on AWS with no effort.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Datajob simplifies Step Functions workflow definition with a Pythonic `task1 >> task2` syntax, reducing the need for complex JSON or YAML configurations, as shown in the quickstart example.
Automatically provisions dedicated data and deployment S3 buckets for artifacts, accessible via `datajob_stack.context.data_bucket_name`, eliminating manual bucket setup and configuration.
Supports packaging projects as wheels using Poetry or setup.py and deploying them to AWS, demonstrated in examples where `project_root` is specified for easy distribution.
Abstracts AWS Glue job deployment for Python shell and PySpark jobs with minimal configuration, allowing focus on job logic rather than infrastructure details.
Deeply integrated with AWS services like Glue and Step Functions, making it unsuitable for multi-cloud or hybrid deployments without significant rework.
Requires installation of aws-cdk@1.109.0 via npm, which can lead to version conflicts and limits compatibility with newer CDK releases without framework updates.
Only supports Glue version 2.0 for PySpark jobs, as noted in the README, restricting access to newer features or versions available in AWS Glue.