Showing 36 of 72 projects
A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.
Apache Heron is a real-time, distributed, fault-tolerant stream processing engine developed by Twitter.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
A high-performance distributed map/reduce system with DAG execution, written in Go, supporting standalone or distributed modes.
A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.
A Scala API for Cascading that simplifies writing Hadoop MapReduce jobs with Scala integration.
Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.
A distributed computation system written in Go for parallel and cluster processing, similar to Hadoop MapReduce and Spark.
A fast, highly-scalable graph database supporting over 10 billion vertices and edges with OLTP capabilities and dual Gremlin/Cypher query language support.
A Go implementation of Roaring bitmaps, a compressed bitmap data structure for fast set operations on large integer datasets.
A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.
A distributed service for efficiently collecting, aggregating, and moving large amounts of log-like data.
A real-time distributed analytical database built entirely on bitmaps for low-latency queries on fresh data.
A Python library for loading, shaping, embedding, and exploring large graphs with GPU-accelerated visualization and analytics.
A fast, thread-safe in-memory cache for Go designed to handle massive entry counts with minimal garbage collection overhead.
A distributed, multi-tenant gateway providing serverless SQL on data warehouses and lakehouses.
A cluster computing framework for processing large-scale geospatial data within Apache Spark, Flink, and other big data systems.
A Scala library providing abstract algebra types and structures for building aggregation systems and analytics.
A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.
A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.
A Big Data IDE for discovering, creating, and sharing data analyses, queries, and tables with collocated metadata.
.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.
A distributed query execution engine that extends Apache DataFusion to run SQL queries in parallel across multiple nodes.
Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.
A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
A high-performance Python package for fast, multi-threaded manipulation of large tabular datasets, inspired by R's data.table.
A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.
A fault-tolerant service that persists Kafka log data to cloud storage like S3, GCS, Azure Blob Storage, and OpenStack Swift.
A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.
A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.
A JavaScript library for fast n-dimensional filtering and grouping of large multivariate datasets in the browser.
An open source, serverless security data lake for AWS that normalizes logs, enables detection-as-code, and supports petabyte-scale threat hunting.
A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.
Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.
Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.
An embedded database for serverless and edge runtimes, storing data as Parquet on S3 with stateless compute.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.