Question 1

How to deploy a PySpark job with Datajob?

Accepted Answer

Define a GlueJob with job_type='glueetl' and specify arguments like source and destination S3 paths, as shown in the PySpark example. Datajob handles the infrastructure provisioning, and you can package your code as a wheel for deployment.

Question 2

Datajob vs AWS CDK for data pipelines?

Accepted Answer

Datajob provides a higher-level abstraction over AWS CDK, simplifying common patterns like Glue job deployment and Step Functions orchestration with less boilerplate. However, direct CDK offers more granular control for custom or complex AWS resource configurations.

Question 3

Can I use Datajob with AWS Lambda or ECS?

Accepted Answer

Currently, Datajob primarily supports Glue, SageMaker, and Step Functions. The README mentions plans for Lambda and ECS Fargate integration, but as of now, these are not implemented, limiting service options.

Question 4

How to set up email notifications for pipeline failures?

Accepted Answer

Add a 'notification' parameter with email addresses in the StepfunctionsWorkflow constructor, which creates an SNS topic for success or failure alerts, as detailed in the README's notification section.

Question 5

What's the cost of using Datajob on AWS?

Accepted Answer

Datajob itself is open-source and free, but you pay for underlying AWS services like Glue, Step Functions, and S3. It can reduce operational overhead, but costs depend on resource usage and pipeline scale.

Question 6

How to handle parallel tasks in Datajob workflows?

Accepted Answer

Use the `>>` syntax to define dependencies, allowing tasks to run in parallel if they don't have sequential constraints, as demonstrated in the parallel execution example with multiple task chains.

Datajob

What is Datajob?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions