Showing 36 of 51 projects
A modern, enterprise-ready business intelligence web application for data visualization and exploration.
A platform to programmatically author, schedule, and monitor workflows as code.
A platform to programmatically author, schedule, and monitor workflows as code.
A visual roadmap outlining the skills, technologies, and learning paths to become an Artificial Intelligence expert in 2022.
A curated collection of papers and articles from companies sharing real-world data science and machine learning applications in production.
A workflow orchestration framework for building resilient data pipelines in Python.
Open-source data integration platform for building ELT pipelines from APIs, databases, and files to data warehouses, lakes, and lakehouses.
An orchestration platform for developing, deploying, and monitoring data pipelines and assets.
A curated list of awesome big data frameworks, resources, and tools across various categories.
A curated list of awesome big data frameworks, resources, and tools across various categories.
A visual roadmap and study guide covering the modern data engineering landscape for aspiring data engineers in 2021.
A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.
A transformation workflow that enables data teams to transform data in their warehouse using SQL and software engineering best practices.
A Kafka-compatible streaming data platform that's 10x faster, with no ZooKeeper or JVM dependencies.
A Python library for data quality testing and validation using expressive, extensible Expectations.
A Python framework for creating reproducible, maintainable, and modular data engineering and data science pipelines.
A no-dependency Python SQL parser, transpiler, optimizer, and engine that translates between 31+ SQL dialects.
Enterprise-grade event streaming platform that continuously ingests, processes, and serves real-time data with Apache Iceberg™ integration.
An enterprise-grade event streaming platform that ingests, processes, and manages real-time event data with PostgreSQL compatibility and Apache Iceberg™ integration.
A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
A curated list of essential academic papers for understanding database fundamentals and building modern data systems.
An open-source feature store for managing and serving machine learning features for training and online inference.
A portable Python dataframe library that compiles to SQL and works with over 20 backends for unified data manipulation.
An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
An open-source tool that transforms object storage into a Git-like repository for versioned, atomic, and repeatable data lake operations.
A metadata-driven data discovery and catalog platform that helps data teams find, understand, and trust their data resources.
Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.
An open-source, privacy-focused customer data platform (CDP) that collects, processes, and routes event data to warehouses and tools.
A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.
A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.
The fastest way to build data pipelines with iterative development and deployment anywhere.
A blazing-fast command-line toolkit for querying, slicing, analyzing, transforming, and validating tabular data (CSV, Excel, JSONL, etc.).
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.