Data Pipelines

33 projects

Showing 33 of 33 projects

A Python ETL framework for stream processing, real-time analytics, and building live LLM/RAG pipelines, powered by a scalable Rust engine.

#stream-processing#batch-processing#machine-learning-algorithms

A platform to programmatically author, schedule, and monitor workflows as code.

#apache#airflow#devops

A platform to programmatically author, schedule, and monitor workflows as code.

#apache#airflow#devops

An open-source, event-driven orchestration platform for building reliable scheduled and real-time workflows using declarative YAML.

#event-driven#data-orchestration#devops

A workflow orchestration framework for building resilient data pipelines in Python.

#retry-logic#devops#workflow

Open-source data integration platform for building ELT pipelines from APIs, databases, and files to data warehouses, lakes, and lakehouses.

#open-source#pipeline#data-integration

An open source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

#hacktoberfest#argo-workflows#airflow

An orchestration platform for developing, deploying, and monitoring data pipelines and assets.

#data-orchestration#data-assets#devops

A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.

#version-control#pypa#business-intelligence

A transformation workflow that enables data teams to transform data in their warehouse using SQL and software engineering best practices.

#data-documentation#pypa#business-intelligence

A Kafka-compatible streaming data platform that is 10x faster, simpler to operate, and free from ZooKeeper and JVMs.

#realtime#event-driven#event-driven-architecture

Stars12.0k

Forks732

Last commit1 day ago

Great ExpectationsPython

A Python library for data quality testing and validation using expressive, extensible Expectations.

#data-testing#datacleaning#open-source

Stars11.4k

Forks1.7k

Last commit2 days ago

KedroPython

A Python framework for creating reproducible, maintainable, and modular data engineering and data science pipelines.

#hacktoberfest#data-science#pipeline

Stars10.8k

Forks1.0k

Last commit2 days ago

awesome-data-engineering

A curated list of data engineering tools, frameworks, databases, and resources for software developers.

#stream-processing#workflow-orchestration#awesome-list

Stars8.5k

Forks1.5k

Last commit19 days ago

Data Engineering

A curated list of data engineering tools, frameworks, databases, and resources for software developers.

#stream-processing#batch-processing#workflow-orchestration

Stars8.5k

Forks1.5k

Last commit19 days ago

papermillPython

A Python tool for parameterizing, executing, and analyzing Jupyter Notebooks at scale.

#julia#notebook#publishing

Stars6.4k

Forks449

Last commit17 days ago

MaterializeRust

A real-time data integration platform that creates and continually updates consistent views of transactional data using SQL.

#stream-processing#postgresql-dialect#database

A lean distributed data streaming engine and stream processing framework written in Rust for building responsive data-intensive applications.

#stream-processing#event-driven#webassembly

A distributed stream processing engine in Rust that performs stateful computations on real-time data with subsecond results.

#stream-processing#event-driven#sql-engine

Stars4.9k

Forks351

Last commit3 days ago

Azkaban (.5k)Java

Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.

#hacktoberfest#gradle#batch-processing

Stars4.5k

Forks1.6k

Last commit1 year ago

panderaPython

A flexible and expressive API for performing statistical data validation on dataframe-like objects.

#data-cleaning#pandas-validation#python-library

Stars4.3k

Forks394

Last commit2 days ago

PloomberPython

The fastest way to build data pipelines with iterative development and deployment anywhere.

#deployment#pipelines#airflow

Stars3.6k

Forks241

Last commit10 months ago

deequScala

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks583

Last commit23 days ago

awesome-etl list

A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.

#open-source#workflow-orchestration#data-integration

Stars3.5k

Forks367

Last commit1 month ago

ScaldingScala

A Scala API for Cascading that simplifies writing Hadoop MapReduce jobs with Scala integration.

#cascading#mapreduce#functional-programming

Stars3.5k

Forks701

Last commit2 years ago

glowGo

A distributed computation system written in Go for parallel and cluster processing, similar to Hadoop MapReduce and Spark.

#mapreduce#cluster-computing#go-library

Stars3.2k

Forks249

Last commit7 years ago

HamiltonJupyter Notebook

A Python library for defining portable, modular, and testable data transformation DAGs with built-in lineage and metadata.

#data-lineage#etl-pipeline#python-library

Stars2.5k

Forks184

Last commit2 days ago

HamiltonJupyter Notebook

A lightweight Python library for creating portable, expressive, and testable data transformation DAGs with built-in lineage and metadata.

#data-lineage#etl-pipeline#python-library

Stars2.5k

Forks184

Last commit2 days ago

FS2(prev. 'Scalaz-Stream')Scala

A purely functional, effectful, and polymorphic stream processing library for Scala built on Cats and Cats-Effect.

#stream-processing#functional-programming#scala-js

Stars2.4k

Forks632

Last commit7 days ago

BytewaxPython

A Python framework and Rust-based distributed processing engine for stateful event and stream processing.

#stream-processing#event-driven#data-science

Stars2.0k

Forks107

Last commit1 year ago

murexGo

A smarter shell and scripting environment with advanced features for usability, safety, and productivity in DevOps tooling.

#usability#bash-alternative#shell-scripting

Stars1.9k

Forks38

Last commit9 days ago

BruinGo

A unified data pipeline tool for ingestion, transformation with SQL/Python/R, and data quality checks across major platforms.

#data-modeling#data-quality#python

Stars1.5k

Forks72

Last commit2 days ago

MLeapScala

MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.

#apache-spark#spark#production-ml

Stars1.5k

Forks316

Last commit1 month ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub