The "Awesome Data Engineering" project is a curated collection of resources aimed at supporting professionals in the field of data engineering, which involves the design and construction of systems for collecting, storing, and analyzing data. This list encompasses a variety of categories, including data pipelines, ETL tools, data warehousing solutions, frameworks, and best practices, as well as tutorials and community resources. Whether you are a beginner looking to understand the fundamentals or an experienced engineer seeking advanced techniques, this list offers valuable insights and tools to enhance your data engineering projects. Dive into this collection to discover the tools and methodologies that can streamline your data workflows and improve your data management capabilities.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The "Awesome Public Datasets" project is a curated collection of publicly available datasets across various domains, including government, healthcare, finance, and social sciences. This list features datasets in multiple formats, along with links to tools and platforms that facilitate data analysis and visualization. It is an invaluable resource for researchers, data scientists, and students looking to access high-quality data for their projects or studies. By providing a wide array of datasets, this collection empowers users to explore, analyze, and derive insights from real-world data. Dive in to discover the wealth of information available for your next data-driven endeavor!
The "Awesome Big Data" project is a curated collection of resources focused on big data technologies and practices that enable the processing and analysis of vast amounts of data. This list encompasses a variety of categories, including frameworks, tools, libraries, databases, and tutorials that cater to both beginners and experienced data professionals. Users can explore resources related to data storage, processing, analytics, and visualization, making it an invaluable asset for data scientists, engineers, and researchers. Whether you're looking to enhance your big data skills or find the right tools for your projects, this collection provides a comprehensive guide to navigating the big data landscape.
The "Awesome Network Analysis" project is a curated collection of resources focused on the study and analysis of networks, which are structures made up of interconnected elements. This list encompasses a variety of tools, libraries, datasets, and tutorials that facilitate the exploration of network theory, graph analysis, and visualization techniques. It serves as a valuable resource for researchers, data scientists, and enthusiasts interested in understanding complex systems, social networks, and data relationships. Whether you are a beginner looking to grasp the basics or an experienced analyst seeking advanced methodologies, this collection provides essential tools and insights to enhance your network analysis projects.
The "Awesome Streaming" project is a curated collection of resources focused on streaming technologies, which enable the real-time processing and distribution of data. This list encompasses a variety of categories including frameworks, libraries, tools, tutorials, and community resources that cater to different streaming protocols and architectures. It is beneficial for developers, data engineers, and researchers who are looking to implement or enhance streaming solutions in their applications. With a wealth of information and tools at your disposal, users can explore innovative ways to manage and analyze streaming data effectively.
A lightweight, fault-tolerant distributed relational database built on SQLite, designed for high availability with minimal operational effort.
An open-source, cloud-native, distributed SQL database offering MySQL compatibility, horizontal scalability, and HTAP capabilities.
A collection of Python scripts for automating MySQL server lifecycle management, backups, failovers, and replication monitoring in production environments.
A lightweight, high-performance network server for the Kyoto Cabinet key-value database with replication and memcached protocol support.
A key-value datastore for Arduino and resource-constrained embedded systems with disk-based persistent storage.
A Python tool to easily create, manage, and destroy local Apache Cassandra clusters for testing.
A high-performance NoSQL database compatible with Apache Cassandra and Amazon DynamoDB, built on a shared-nothing architecture.
A distributed, Prometheus-compatible, real-time, in-memory time series database designed for massive scalability and low-latency operational metrics.
A distributed transactional in-memory database that adds ACID transactions to MongoDB while maintaining scalability.
A specialized database for social interactions (likes, views, follows) that precomputes data at write time for real-time, high-scale reads.
A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.
A scalable time series database optimized for real-time metrics, events, and analytics with fast query response.
A fast distributed scalable time series database built on top of Cassandra.
A scalable time series database built on Bigtable, Cassandra, and Elasticsearch for high-volume metrics.
A high-performance real-time analytics database designed for fast queries and ingest to reduce time to insight.
A high-performance time-series database optimized for modern hardware, supporting both metrics and events with efficient compression.
A fast, low-overhead metric database written in pure Erlang, optimized for time-series data storage and querying.
A multi-tenant distributed system for ingesting, rolling up, and serving time series metrics at massive scale.
A secure time series database backed by Apache Accumulo with Grafana integration for data visualization.
An in-memory computing platform combining a high-performance database and Lua application server for scalable web components.
An open-source graph database for linked data, inspired by Google's Knowledge Graph.
A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.
A CLI tool to copy data between any databases and platforms with a single command, no code required.
Simplified command-line administration tool for Kafka brokers, providing essential management operations.
A lightweight, non-JVM command-line tool for producing, consuming, and inspecting Apache Kafka messages.
A PostgreSQL extension that enables sending messages directly to Apache Kafka from within the database.
A high-performance C/C++ client library for Apache Kafka, supporting producers, consumers, and admin operations.
A Docker image and configuration for running Apache Kafka in containerized environments.
A web-based tool for managing Apache Kafka clusters, enabling cluster inspection, topic management, and partition operations.
A Node.js client for Apache Kafka 0.9 and later, providing producers, consumers, and administrative APIs.
A fault-tolerant service that persists Kafka log data to cloud storage like S3, GCS, Azure Blob Storage, and OpenStack Swift.
A snappy open-source proxy for Apache Kafka that enables encryption, multi-tenancy, and schema validation.
A deprecated tool for collecting, processing, and delivering data from multiple sources with Go and Lua plugin support.
A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.
A Python library that simplifies data integration between pandas and AWS services like Athena, S3, Redshift, and more.
A polyglot document intelligence framework with a Rust core for extracting text, metadata, and structured data from 91+ file formats.
A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.
A high-performance distributed POSIX file system for cloud-native environments, storing data in object storage and metadata in databases.
A fast compression/decompression library optimized for speed over maximum compression.
A language-neutral, platform-neutral, extensible mechanism for serializing structured data developed by Google.
A fast and efficient binary object graph serialization and cloning framework for Java.