Data Engineering

Apache SupersetPython

A modern, enterprise-ready business intelligence web application for data visualization and exploration.

#bi-tool#apache#data-viz

A platform to programmatically author, schedule, and monitor workflows as code.

#apache#airflow#devops

A platform to programmatically author, schedule, and monitor workflows as code.

#apache#airflow#devops

Stars46.2k

Forks17.4k

AI Expert RoadmapJavaScript

A visual roadmap outlining the skills, technologies, and learning paths to become an Artificial Intelligence expert in 2022.

#ai#neural-network#roadmap

Stars31.1k

Forks2.6k

Last commit10 months ago

eugeneyan/applied-ml GitHub repository

A curated collection of papers and articles from companies sharing real-world data science and machine learning applications in production.

#search#applied-machine-learning#tech-blogs

Stars29.9k

Forks4.0k

Last commit2 years ago

prefectPython

A workflow orchestration framework for building resilient data pipelines in Python.

#retry-logic#devops#workflow

Stars23.4k

Forks2.4k

#open-source#pipeline#data-integration

Airbyte (k)Python

Open-source data integration platform for building ELT pipelines from APIs, databases, and files to data warehouses, lakes, and lakehouses.

An orchestration platform for developing, deploying, and monitoring data pipelines and assets.

#data-orchestration#data-assets#devops

Stars15.9k

Forks2.2k

#database#data-science#distributed-systems

awesome-bigdata

A curated list of awesome big data frameworks, resources, and tools across various categories.

Stars14.5k

Forks2.6k

Last commit2 months ago

Big Data

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-storage#open-source

Stars14.5k

Forks2.6k

Last commit2 months ago

dbt-coreRust

A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.

#version-control#pypa#business-intelligence

A transformation workflow that enables data teams to transform data in their warehouse using SQL and software engineering best practices.

#data-documentation#pypa#business-intelligence

Stars13.5k

Forks2.5k

#roadmap#skill-development#data-engineering

Data Engineer Roadmap

A visual roadmap and study guide covering the modern data engineering landscape for aspiring data engineers in 2021.

Stars12.8k

Forks1.3k

Last commit4 years ago

RedpandaC++

A Kafka-compatible streaming data platform that's 10x faster, with no ZooKeeper or JVM dependencies.

#realtime#event-driven#high-performance

Stars12.4k

Forks770

#data-testing#datacleaning#open-source

Great ExpectationsPython

A Python library for data quality testing and validation using expressive, extensible Expectations.

Stars11.7k

Forks1.8k

#agentic-workflow#hacktoberfest#agentic-ai

KedroPython

A Python framework for creating reproducible, maintainable, and modular data engineering and data science pipelines.

Stars10.9k

Forks1.1k

Last commit4 days ago

xonshPython

🐚 Python-powered shell. Full-featured, cross-platform and AI-friendly.

#devops#alacritty#iterm2

A no-dependency Python SQL parser, transpiler, optimizer, and engine that translates between 31+ SQL dialects.

#sql-engine#postgres#sql-optimizer

Stars9.4k

Forks1.2k

#stream-processing#etl-pipeline#database

RisingWaveRust

An enterprise-grade event streaming platform that ingests, processes, and manages real-time event data with PostgreSQL compatibility and Apache Iceberg™ integration.

Stars9.2k

Forks797

RisingWaveLabs/RisingWaveRust

Enterprise-grade event streaming platform that continuously ingests, processes, and serves real-time data with Apache Iceberg™ integration.

#stream-processing#etl-pipeline#database

A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.

#apache-flink#hacktoberfest#apache-spark

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8.9k

Forks2.1k

#stream-processing#workflow-orchestration#awesome-list

awesome-data-engineering

A curated list of data engineering tools, frameworks, databases, and resources for software developers.

Stars8.9k

Forks1.6k

#stream-processing#batch-processing#workflow-orchestration

Data Engineering

A curated list of data engineering tools, frameworks, databases, and resources for software developers.

Stars8.9k

Forks1.6k

#database-architecture#distributed-systems#database-fundamentals

db-readings

A curated list of essential academic papers for understanding database fundamentals and building modern data systems.

Stars8.1k

Forks925

Last commit1 year ago

Feast - A Feature Store for ML for GCP by Gojek/GooglePython

An open-source feature store for managing and serving machine learning features for training and online inference.

#features#batch-processing#data-science

A portable Python dataframe library that compiles to SQL and works with over 20 backends for unified data manipulation.

#database#python-dataframe#sql-compilation

An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.

#apache-flink#upsert-delete#stream-processing

An open-source tool that transforms object storage into a Git-like repository for versioned, atomic, and repeatable data lake operations.

#multi-cloud#data-versioning#azure-blob-storage

Stars5.5k

Forks464

#data-lineage#data-catalog#data-engineering

AmundsenPython

A metadata-driven data discovery and catalog platform that helps data teams find, understand, and trust their data resources.

Stars4.8k

Forks965

Last commit20 days ago

Azkaban (.5k)Java

Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.

#hacktoberfest#gradle#batch-processing

Stars4.5k

Forks1.6k

Last commit2 years ago

RudderStackGo

An open-source, privacy-focused customer data platform (CDP) that collects, processes, and routes event data to warehouses and tools.

#event-collection#segment-alternative#warehouse-management

A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.

#apache-arrow#data-science#glue-catalog

Stars4.1k

Forks737

#apache-arrow#data-science#redshift

aws-data-wranglerPython

A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.

Stars4.1k

Forks737