Showing 33 of 33 projects
A Python ETL framework for stream processing, real-time analytics, and building live LLM/RAG pipelines, powered by a scalable Rust engine.
A platform to programmatically author, schedule, and monitor workflows as code.
A platform to programmatically author, schedule, and monitor workflows as code.
An open-source, event-driven orchestration platform for building reliable scheduled and real-time workflows using declarative YAML.
A workflow orchestration framework for building resilient data pipelines in Python.
Open-source data integration platform for building ELT pipelines from APIs, databases, and files to data warehouses, lakes, and lakehouses.
An open source container-native workflow engine for orchestrating parallel jobs on Kubernetes.
An orchestration platform for developing, deploying, and monitoring data pipelines and assets.
A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.
A transformation workflow that enables data teams to transform data in their warehouse using SQL and software engineering best practices.
A Kafka-compatible streaming data platform that is 10x faster, simpler to operate, and free from ZooKeeper and JVMs.
A Python library for data quality testing and validation using expressive, extensible Expectations.
A Python framework for creating reproducible, maintainable, and modular data engineering and data science pipelines.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
A Python tool for parameterizing, executing, and analyzing Jupyter Notebooks at scale.
A real-time data integration platform that creates and continually updates consistent views of transactional data using SQL.
A lean distributed data streaming engine and stream processing framework written in Rust for building responsive data-intensive applications.
A distributed stream processing engine in Rust that performs stateful computations on real-time data with subsecond results.
Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.
A flexible and expressive API for performing statistical data validation on dataframe-like objects.
The fastest way to build data pipelines with iterative development and deployment anywhere.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.
A Scala API for Cascading that simplifies writing Hadoop MapReduce jobs with Scala integration.
A distributed computation system written in Go for parallel and cluster processing, similar to Hadoop MapReduce and Spark.
A Python library for defining portable, modular, and testable data transformation DAGs with built-in lineage and metadata.
A lightweight Python library for creating portable, expressive, and testable data transformation DAGs with built-in lineage and metadata.
A purely functional, effectful, and polymorphic stream processing library for Scala built on Cats and Cats-Effect.
A Python framework and Rust-based distributed processing engine for stateful event and stream processing.
A smarter shell and scripting environment with advanced features for usability, safety, and productivity in DevOps tooling.
A unified data pipeline tool for ingestion, transformation with SQL/Python/R, and data quality checks across major platforms.
MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.