Showing 36 of 42 projects
A platform to programmatically author, schedule, and monitor workflows as code.
A platform to programmatically author, schedule, and monitor workflows as code.
Open-source data integration platform for building ELT pipelines from APIs, databases, and files to data warehouses, lakes, and lakehouses.
An orchestration platform for developing, deploying, and monitoring data pipelines and assets.
A server-side data processing pipeline that ingests, transforms, and ships logs and events from multiple sources.
A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.
A low-latency platform for change data capture (CDC) that streams row-level changes from databases to applications.
A high-performance, resilient stream processor that connects various sources and sinks, performs data transformations, and guarantees at-least-once delivery.
A high-performance, declarative stream processor that connects various sources and sinks with built-in data transformation capabilities.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
An open-source ETL (Extract, Transform, Load) tool for data integration and migration.
An ultra-performant data transformation framework for AI, with incremental processing and data lineage built-in.
A data loading and migration tool for PostgreSQL that handles errors gracefully and transforms data from various sources.
Open-source data pipelines for cloud asset inventory, CSPM, FinOps, and vulnerability management across AWS, Azure, GCP, and 70+ sources.
Open-source data pipelines to sync cloud infrastructure metadata from AWS, Azure, GCP, and 70+ sources into your data warehouse.
An easy-to-use, powerful, and reliable system to process and distribute data across cybersecurity, observability, and AI pipelines.
Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.
A Python library using machine learning for accurate and scalable fuzzy matching, record deduplication, and entity resolution on structured data.
A MySQL change data capture daemon that streams database changes as JSON to Kafka, Kinesis, and other platforms.
A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.
A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.
A blazing-fast command-line toolkit for querying, slicing, analyzing, transforming, and validating tabular data (CSV, Excel, JSONL, etc.).
A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.
A command-line tool to efficiently and securely sync data between PostgreSQL databases with parallel transfers and data masking.
A CLI tool to copy data between any databases and platforms with a single command, no code required.
Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.
A Java-based tool for importing tabular data from JDBC sources into Elasticsearch for indexing.
A collection of utilities, scripts, and views for managing, optimizing, and automating Amazon Redshift data warehouse operations.
A Scala API for Apache Beam and Google Cloud Dataflow, enabling unified batch and streaming data processing.
A Python library for defining portable, modular, and testable data transformation DAGs with built-in lineage and metadata.
A lightweight Python library for creating portable, expressive, and testable data transformation DAGs with built-in lineage and metadata.
Instill Core is a full-stack AI infrastructure tool for data, model, and pipeline orchestration to build versatile AI-first applications.
A single C++ binary SQL engine for high-performance stream processing, analytics, observability, and AI/ML pipelines.
A lightweight and efficient stream processing library for Go, providing a declarative DSL to build data pipelines.
A masterless, cloud-scale, fault-tolerant distributed computation system for batch and stream processing written in Clojure.
A Python CLI utility and library for manipulating SQLite databases, including importing JSON/CSV and running in-memory queries.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.