Showing 36 of 219 projects
An embedded database for serverless and edge runtimes, storing data as Parquet on S3 with stateless compute.
A one-stop, full-scenario integration framework for massive data, supporting data ingestion, synchronization, and subscription.
A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.
A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.
A native Go client library and command-line tool for HDFS that connects directly to the namenode via protocol buffers.
A Java library for disseminating in-memory datasets from a single producer to many consumers for high-performance read-only access.
Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.
A curated collection of 500+ resources for data analysis and data science, covering Python, SQL, ML, visualization, roadmaps, and interview prep.
A free software AI accelerator that speeds up scikit-learn applications by 10-100x on CPUs and GPUs with no code changes.
A memory-efficient PHP stream parser for large JSON files and streams, enabling iteration without loading entire documents.
A high-performance one-pass in-memory streaming analytics engine for temporal and streaming data.
SQL-based streaming analytics platform that scales to process hundreds of billions of real-time events daily.
A DataFrame-based graph processing library for Apache Spark, enabling scalable graph analytics and algorithms.
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.
An open-source, in-memory, distributed batch and stream processing engine for Java applications.
An open-source Java framework for rapid development of machine learning and statistical applications with large dataset support.
A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.
An open specification for storing geospatial vector data (points, lines, polygons) in the Apache Parquet columnar storage format.
An improved HyperLogLog implementation with LogLog-Beta bias correction, sparse representation, and flexible precision for cardinality estimation.
A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.
A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.
An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.
A REST interface for interacting with Apache Spark from anywhere, enabling remote code execution and job submissions.
A Java library of stochastic streaming algorithms (sketches) for approximate analysis of massive datasets.
C# and F# language binding and extensions for Apache Spark, enabling .NET developers to write Spark driver programs and data processing operations.
Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.
LinkedIn's previous generation Kafka to HDFS pipeline for batch data ingestion.
A learned index structure enabling fast lookups, range searches, and updates on billions of items with minimal space usage.
A centralized platform for security monitoring and analysis, integrating big data technologies for log aggregation, threat detection, and behavioral analytics.
A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.
A distributed stream processing framework built on Apache Kafka and Apache Hadoop YARN for fault-tolerant, stateful processing.
A Python API for Deequ, enabling data quality testing and validation on large datasets using Apache Spark.
A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.
A collection of R packages for interacting with Hadoop ecosystems, enabling big data analysis from R.
A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.
A pure Go client library for interacting with HBase databases, supporting HBase >= 1.0.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.