Data Engineering

103 projects

Showing 36 of 103 projects

deequScala

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks584

Last commit3 days ago

PloomberPython

The fastest way to build data pipelines with iterative development and deployment anywhere.

#deployment#pipelines#airflow

Stars3.6k

Forks242

Last commit1 year ago

awesome-etl list

A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.

#open-source#workflow-orchestration#data-integration

Stars3.6k

Forks372

Last commit2 months ago

clickhouse-goGo

A high-performance Go driver for ClickHouse offering both native and standard database/sql interfaces.

#columnar-database#database-driver#database

High-performance datastore optimized for time series and tick data storage and retrieval.

#database#data-storage#quantitative-analysis

Stars3.1k

Forks574

Last commit2 years ago

Streaming

A curated list of awesome streaming frameworks, applications, readings, and resources for stream processing.

#stream-processing#message-queue#real-time-analytics

Fast tool for comparing datasets within or across SQL databases to identify differences.

#database#python-library#data-science

Stars3.0k

Forks310

Last commit2 years ago

MeltanoPython

A declarative code-first data integration engine that unlocks 600+ APIs and databases, eliminating the need to write and maintain custom API integrations.

#data-orchestration#meltano-sdk#pipelines

Stars2.6k

Forks258

Last commit1 day ago

HamiltonJupyter Notebook

A Python library for defining portable, modular, and testable data transformation DAGs with built-in lineage and metadata.

#data-lineage#etl-pipeline#python-library

Stars2.6k

Forks201

Last commit5 days ago

HamiltonJupyter Notebook

A lightweight Python library for creating portable, expressive, and testable data transformation DAGs with built-in lineage and metadata.

#data-lineage#etl-pipeline#python-library

Stars2.6k

Forks201

Last commit5 days ago

BytewaxPython

A Python framework and Rust-based distributed processing engine for stateful event and stream processing.

#stream-processing#event-driven#data-science

Stars2.0k

Forks112

Last commit1 month ago

Apache SparkShell

A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.

#apache-spark#data-science#spark-ecosystem

Stars1.9k

Forks346

Last commit4 months ago

KibaRuby

A Ruby framework for writing reliable, concise, and maintainable ETL (Extract-Transform-Load) data processing jobs.

#rubydatascience#etl-ruby#ruby-gem

Stars1.8k

Forks90

Last commit6 months ago

SQLLineagePython

A Python-powered SQL lineage analysis tool that extracts source and target tables from SQL commands without deep parser knowledge.

#ast-analysis#data-lineage#data-engineering

Stars1.7k

Forks281

Last commit7 days ago

MultiwovenRuby

An open-source Reverse ETL platform for syncing data from warehouses to business tools like Salesforce, HubSpot, and Slack.

#open-source#reverse-etl#data-integration

Stars1.7k

Forks92

Last commit22 hours ago

Quix StreamsPython

A Python framework for building real-time data pipelines and event-driven microservices on Apache Kafka using a Streaming DataFrame API.

#stream-processing#streaming-data-processing#event-driven-architecture

Stars1.6k

Forks107

Last commit

spark-testing-baseScala

Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.

#apache-spark#unit-testing#integration-testing

Stars1.6k

Forks356

Last commit3 months ago

pyjanitorPython

Python library providing clean, chainable functions for data cleaning and manipulation with pandas DataFrames.

#data-cleaning#hacktoberfest#python-library

Stars1.5k

Forks189

Last commit4 days ago

Project NessieJava

A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.

#version-control#iceberg#data-versioning

Stars1.5k

Forks183

Last commit21 hours ago

Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.

#awesome-list#data-engineering#big-data

Stars1.1k

Forks254

Last commit2 years ago

aws-big-data-blogJava

Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.

#code-samples#aws-services#data-engineering

Stars893

Forks613

Last commit4 years ago

activeWorkflowRuby

A polyglot workflow automation platform that orchestrates self-contained agents written in any language, enabling periodic execution, polling, and event-driven orchestration.

#event-driven#scheduled-tasks#activeworkflow

A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.

#python-hdfs-client#python-library#distributed-storage

Stars857

Forks213

Last commit4 years ago

chispaPython

A PySpark testing library providing fast helper methods with descriptive, color-coded error messages for DataFrame and column comparisons.

#apache-spark#unit-testing#dataframe

Stars771

Forks80

Last commit12 days ago

spark-dariaScala

A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.

#apache-spark#spark-extensions#spark

Stars767

Forks150

Last commit1 month ago

HStreamDBHaskell

An open-source, cloud-native streaming database designed for real-time data processing and IoT applications.

#stream-processing#iot#haskell

Stars721

Forks54

Last commit1 year ago

Carefully Curated 70 Spark Questions with Additional Optimization Guides (First in the series)

A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.

#apache-spark#spark#performance-optimization

Stars691

Forks80

Last commit4 years ago

quinnPython

A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.

#dataframe-utilities#apache-spark#spark-extensions

Stars687

Forks95

Last commit1 month ago

FlintrockPython

A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.

#apache-spark#devops#apache-spark-cluster

Stars651

Forks120

Last commit1 year ago

desbordanteC++

A high-performance data profiler for discovering and validating complex patterns like functional dependencies, inclusion dependencies, and association rules.

#data-cleaning#pattern-discovery#data-science

A high-performance data profiler for discovering and validating complex patterns in datasets, enabling data cleaning and quality analysis.

#data-cleaning#cpp-library#data-science

Stars492

Forks101

Last commit5 days ago