Data Engineering

The "Awesome Data Engineering" project is a curated collection of resources aimed at supporting professionals in the field of data engineering, which involves the design and construction of systems for collecting, storing, and analyzing data. This list encompasses a variety of categories, including data pipelines, ETL tools, data warehousing solutions, frameworks, and best practices, as well as tutorials and community resources. Whether you are a beginner looking to understand the fundamentals or an experienced engineer seeking advanced techniques, this list offers valuable insights and tools to enhance your data engineering projects. Dive into this collection to discover the tools and methodologies that can streamline your data workflows and improve your data management capabilities.

data-pipelinesetl-toolsdata-warehousingbig-datadata-architecturedata-integrationanalyticsdata-management

RSS View on GitHub

8.5k stars1.5k forks0 contributorsUpdated

Databases

59 projects

RQLite

A lightweight, fault-tolerant distributed relational database built on SQLite, designed for high availability with minimal operational effort.

An open-source, cloud-native, distributed SQL database offering MySQL compatibility, horizontal scalability, and HTAP capabilities.

A collection of Python scripts for automating MySQL server lifecycle management, backups, failovers, and replication monitoring in production environments.

Relational Database Service (RDS)

A lightweight, high-performance network server for the Kyoto Cabinet key-value database with replication and memcached protocol support.

C++2792 years ago

IonDB

A key-value datastore for Arduino and resource-constrained embedded systems with disk-based persistent storage.

A Python tool to easily create, manage, and destroy local Apache Cassandra clusters for testing.

Python1,2353 months ago

ScyllaDB

A high-performance NoSQL database compatible with Apache Cassandra and Amazon DynamoDB, built on a shared-nothing architecture.

C++15,67122 hours ago

A distributed, Prometheus-compatible, real-time, in-memory time series database designed for massive scalability and low-latency operational metrics.

Scala1,46821 hours ago

Percona Server for MongoDB

percona.com

MemDB

A distributed transactional in-memory database that adds ACID transactions to MongoDB while maintaining scalability.

JavaScript5938 years ago

titan.thinkaurelius.com

FlockDB

Scala3,3179 years ago

Actionbase

A specialized database for social interactions (likes, views, follows) that precomputes data at write time for real-time, high-scale reads.

Kotlin22315 hours ago

A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.

Java1,7891 year ago

InfluxDB

A scalable time series database optimized for real-time metrics, events, and analytics with fast query response.

A fast distributed scalable time series database built on top of Cassandra.

Java1,7634 months ago

Heroic

A scalable time series database built on Bigtable, Cassandra, and Elasticsearch for high-volume metrics.

Java8465 years ago

Druid

A high-performance real-time analytics database designed for fast queries and ingest to reduce time to insight.

Java14,03517 hours ago

Riak-TS

basho.com

Akumuli

A high-performance time-series database optimized for modern hardware, supporting both metrics and events with efficient compression.

C++8394 years ago

Dalmatiner DB

A fast, low-overhead metric database written in pure Erlang, optimized for time-series data storage and querying.

Erlang6927 years ago

Blueflood

A multi-tenant distributed system for ingesting, rolling up, and serving time series metrics at massive scale.

Java5981 year ago

Timely

A secure time series database backed by Apache Accumulo with Grafana integration for data visualization.

Java3942 months ago

Tarantool

An in-memory computing platform combining a high-performance database and Lua application server for scalable web components.

Lua3,6551 day ago

cayley

An open-source graph database for linked data, inspired by Google's Knowledge Graph.

Go15,0492 days ago

Snappydata

A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.

Scala1,0323 years ago

Comparison

4 projects

datacompy

A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.

Python6543 days ago

dvt

A Python CLI tool for comparing data across heterogeneous databases and data warehouses to ensure migration accuracy.

Python51519 hours ago

koala-diff

A blazingly fast data comparison tool for Python that instantly compares massive CSV/Parquet datasets, powered by Rust.

Python74 months ago

everyrow

A Python SDK for deploying teams of AI research agents to forecast, score, classify, and gather data at scale.

Python491 day ago

Ingestion

40 projects

ingestr

A CLI tool to copy data between any databases and platforms with a single command, no code required.

Change data capture from PostgreSQL into Kafka using logical decoding, enabling real-time data streaming.

C63 years ago

kafkat

Simplified command-line administration tool for Kafka brokers, providing essential management operations.

Ruby5027 years ago

kafkacat

A lightweight, non-JVM command-line tool for producing, consuming, and inspecting Apache Kafka messages.

C5,7682 years ago

pg-kafka

A PostgreSQL extension that enables sending messages directly to Apache Kafka from within the database.

C11211 years ago

librdkafka

A high-performance C/C++ client library for Apache Kafka, supporting producers, consumers, and admin operations.

C1,0102 days ago

kafka-docker

A Docker image and configuration for running Apache Kafka in containerized environments.

Shell6,9632 years ago

kafka-manager

A web-based tool for managing Apache Kafka clusters, enabling cluster inspection, topic management, and partition operations.

Scala11,9263 years ago

kafka-node

A Node.js client for Apache Kafka 0.9 and later, providing producers, consumers, and administrative APIs.

JavaScript2,6522 years ago

Secor

A fault-tolerant service that persists Kafka log data to cloud storage like S3, GCS, Azure Blob Storage, and OpenStack Swift.

Java1,8574 months ago

Kafka-logger

A Kafka transport for Winston that enables logging to Apache Kafka topics via REST proxy.

JavaScript457 years ago

Kroxylicious

A snappy open-source proxy for Apache Kafka that enables encryption, multi-tenancy, and schema validation.

A deprecated tool for collecting, processing, and delivering data from multiple sources with Go and Lua plugin support.

Go3,3992 years ago

Gobblin

A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.

A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.

Python4,1163 days ago

A PHP library for live importing Google Sheets data into data warehouses with periodic delta loads.

A lightweight Node.js ETL framework for extracting data from databases and loading it into data lakes and warehouses.

TypeScript210 months ago

Kreuzberg

A polyglot document intelligence framework with a Rust core for extracting text, metadata, and structured data from 91+ file formats.

Rust8,69116 hours ago

A CRDT-based merge library that guarantees mathematical convergence for DataFrames, JSON, ML models, and distributed agents.

Python513 days ago

File System

14 projects

HDFS

hadoop.apache.org

Snakebite

A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.

Python8574 years ago

Simple Storage Service (S3)

aws.amazon.com

smart_open

Python3,4539 days ago

A high-performance distributed POSIX file system for cloud-native environments, storing data in object storage and metadata in databases.

A lightweight, HDFS-compatible file system built over Cassandra with a fat driver design for easy deployment.

A fast distributed storage system for blobs, objects, files, and data lakes, optimized for billions of files with O(1) disk seek.

Serialization format

9 projects

A fast compression/decompression library optimized for speed over maximum compression.

A language-neutral, platform-neutral, extensible mechanism for serializing structured data developed by Google.

C++71,62115 hours ago

SequenceFile

wiki.apache.org

Kryo

A fast and efficient binary object graph serialization and cloning framework for Java.

HTML6,5354 days ago

Related Awesome Lists

📦

Public Datasets

The "Awesome Public Datasets" project is a curated collection of publicly available datasets across various domains, including government, healthcare, finance, and social sciences. This list features datasets in multiple formats, along with links to tools and platforms that facilitate data analysis and visualization. It is an invaluable resource for researchers, data scientists, and students looking to access high-quality data for their projects or studies. By providing a wide array of datasets, this collection empowers users to explore, analyze, and derive insights from real-world data. Dive in to discover the wealth of information available for your next data-driven endeavor!

73.8k

📦

Big Data

The "Awesome Big Data" project is a curated collection of resources focused on big data technologies and practices that enable the processing and analysis of vast amounts of data. This list encompasses a variety of categories, including frameworks, tools, libraries, databases, and tutorials that cater to both beginners and experienced data professionals. Users can explore resources related to data storage, processing, analytics, and visualization, making it an invaluable asset for data scientists, engineers, and researchers. Whether you're looking to enhance your big data skills or find the right tools for your projects, this collection provides a comprehensive guide to navigating the big data landscape.

14.3k

📦

Network Analysis

The "Awesome Network Analysis" project is a curated collection of resources focused on the study and analysis of networks, which are structures made up of interconnected elements. This list encompasses a variety of tools, libraries, datasets, and tutorials that facilitate the exploration of network theory, graph analysis, and visualization techniques. It serves as a valuable resource for researchers, data scientists, and enthusiasts interested in understanding complex systems, social networks, and data relationships. Whether you are a beginner looking to grasp the basics or an experienced analyst seeking advanced methodologies, this collection provides essential tools and insights to enhance your network analysis projects.

4.0k

📦

Streaming

The "Awesome Streaming" project is a curated collection of resources focused on streaming technologies, which enable the real-time processing and distribution of data. This list encompasses a variety of categories including frameworks, libraries, tools, tutorials, and community resources that cater to different streaming protocols and architectures. It is beneficial for developers, data engineers, and researchers who are looking to implement or enhance streaming solutions in their applications. With a wealth of information and tools at your disposal, users can explore innovative ways to manage and analyze streaming data effectively.

3.0k

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub