Showing 36 of 72 projects
An organized reading list of patterns, case studies, and articles on building scalable, reliable, and performant large-scale systems.
A distributed event streaming platform for building high-performance data pipelines, streaming analytics, and data integration.
A distributed storage system for object storage (S3), file systems, and Iceberg tables, optimized for billions of files with O(1) disk access.
A comprehensive collection of data science Python notebooks covering deep learning, machine learning, big data, visualization, and essential tools.
A scalable, portable, and distributed gradient boosting library for efficient machine learning across multiple languages and platforms.
A decentralized graph database and synchronization protocol for building real-time, offline-first applications with end-to-end encryption.
A high-performance NoSQL database compatible with Apache Cassandra and Amazon DynamoDB, built on a shared-nothing architecture.
A curated list of awesome big data frameworks, resources, and tools across various categories.
A curated list of awesome big data frameworks, resources, and tools across various categories.
A high-performance real-time analytics database designed for fast queries and ingest to reduce time to insight.
A high-performance distributed POSIX file system for cloud-native environments, storing data in object storage and metadata in databases.
A fast distributed SQL query engine for big data analytics, enabling interactive queries across diverse data sources.
An open source machine learning server for developers and data scientists, supporting event collection, algorithm deployment, and REST API queries.
A web-based tool for managing Apache Kafka clusters, enabling cluster inspection, topic management, and partition operations.
A cloud-native search engine optimized for observability data like logs and traces, offering sub-second search on cloud storage.
A drop-in replacement for pandas that scales data analysis workflows to use all CPU cores and handle out-of-memory datasets.
A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
An extensible SQL query engine written in Rust, using Apache Arrow as its in-memory format for building fast database and analytic systems.
A curated list of data engineering tools, frameworks, databases, and resources for software developers.
A high-performance Python DataFrame library for lazy out-of-core processing and visualization of billion-row datasets at interactive speeds.
A fast, concurrent, evicting in-memory cache for Go designed to store gigabytes of data with minimal GC overhead.
An open-source, in-memory platform for distributed and scalable machine learning with support for a wide range of algorithms and big data technologies.
An open-source, large-scale network packet capture, indexing, and analysis system with a web interface.
An open-source, large-scale network packet capture, indexing, and analysis system for security and network monitoring.
A distributed caching platform that bridges computation frameworks and storage systems for large-scale analytics and ML workloads.
An open-source feature store for managing and serving machine learning features for training and online inference.
A unified real-time data platform combining stream processing with a fast data store for instant action on data-in-motion.
An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
A high-performance, multi-database compatible .NET ORM framework with low-code features and enterprise-ready solutions.
An open-source, distributed graph database optimized for storing and querying large graphs with billions of vertices and edges.
A cluster manager that provides efficient resource isolation and sharing across distributed applications on a shared pool of nodes.
A distributed database for high-performance computing with in-memory speed, ACID compliance, and ANSI SQL support.
A Vue component for rendering large lists with high performance using virtual scrolling.
A high-performance R package for fast data manipulation of large datasets, extending data.frame with concise syntax and memory efficiency.
Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.