Big Data

258 projects

Showing 36 of 258 projects

TonboRust

An embedded database for serverless and edge runtimes, storing data as Parquet on S3 with stateless compute.

#parquet#database#offline-first

Stars1.6k

Forks100

Last commit6 days ago

ElephasPython

Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.

#apache-spark#model-training#spark

Stars1.6k

Forks303

Last commit3 years ago

mongo-hadoopJava

A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.

#mapreduce#bson#spark

Stars1.6k

Forks588

Last commit4 years ago

spark-testing-baseScala

Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.

#apache-spark#unit-testing#integration-testing

Stars1.6k

Forks356

Last commit3 months ago

Apache InLong (.4k)Java

A one-stop, full-scenario integration framework for massive data, supporting data ingestion, synchronization, and subscription.

#massive-data-integration#stream-processing#batch-processing

Stars1.5k

Forks571

Last commit2 days ago

HiBenchJava

A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.

#apache-spark#performance-testing#distributed-systems

Stars1.5k

Forks766

Last commit7 months ago

Project NessieJava

A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.

#version-control#iceberg#data-versioning

Stars1.5k

Forks183

Last commit22 hours ago

hdfs - A native go client for HDFSGo

A native Go client library and command-line tool for HDFS that connects directly to the namenode via protocol buffers.

#distributed-storage#command-line-tool#protocol-buffers

A Java library for disseminating in-memory datasets from a single producer to many consumers for high-performance read-only access.

#java-library#high-performance#data-synchronization

Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.

#apache-spark#spark#notebook

Stars1.4k

Forks443

Last commit10 months ago

Intel(R) Extension for Scikit-learnPython

A free software AI accelerator that speeds up scikit-learn applications by 10-100x on CPUs and GPUs with no code changes.

#oneapi#ai-machine-learning#ai-accelerator

A memory-efficient PHP stream parser for large JSON files and streams, enabling iteration without loading entire documents.

#parsing#stream-processing#json-pointer

Stars1.3k

Forks74

Last commit3 months ago

TrillC#

A high-performance one-pass in-memory streaming analytics engine for temporal and streaming data.

#query-processor#real-time-processing#in-memory-engine

Stars1.3k

Forks133

Last commit2 years ago

AthenaXJava

SQL-based streaming analytics platform that scales to process hundreds of billions of real-time events daily.

#apache-flink#event-processing#flink

Stars1.2k

Forks281

Last commit6 years ago

GraphFramesScala

A DataFrame-based graph processing library for Apache Spark, enabling scalable graph analytics and algorithms.

#graph-processing#apache-spark#network-motifs

Stars1.2k

Forks268

Last commit20 hours ago

Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.

#awesome-list#data-engineering#big-data

Stars1.1k

Forks254

Last commit2 years ago

Hazelcast JetJava

An open-source, in-memory, distributed batch and stream processing engine for Java applications.

#stream-processing#event-processing#hacktoberfest

Stars1.1k

Forks203

Last commit1 year ago

DatumboxJava

An open-source Java framework for rapid development of machine learning and statistical applications with large dataset support.

#regression-analysis#statistical-analysis#large-datasets

Stars1.1k

Forks279

Last commit2 years ago

geoparquetPython

An open specification for storing geospatial vector data (points, lines, polygons) in the Apache Parquet columnar storage format.

#geospatial#gis#columnar-storage

Stars1.1k

Forks73

Last commit5 days ago

ADAMScala

A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.

#genomic-data#apache-spark#parquet

Stars1.1k

Forks312

Last commit4 months ago

hyperloglogGo

An improved HyperLogLog implementation with LogLog-Beta bias correction, sparse representation, and flexible precision for cardinality estimation.

#probabilistic-data-structures#stream-processing#data-sketching

A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.

#stream-processing#in-memory-analytics#apache-spark

Stars1.0k

Forks198

Last commit3 years ago

storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

#distributed#real-time-processing#distributed-systems

Sparkling Water provides H2O functionality inside Spark cluster

#h2o#spark#pysparkling

Stars979

Forks361

Last commit8 months ago

sparklyrR

An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.

#apache-spark#distributed#dplyr

Stars971

Forks308

Last commit22 days ago

LivyScala

A REST interface for interacting with Apache Spark from anywhere, enabling remote code execution and job submissions.

#apache-spark#spark#interactive-computing

Stars958

Forks625

Last commit15 days ago

DataSketchesJava

A Java library of stochastic streaming algorithms (sketches) for approximate analysis of massive datasets.

#statistical-analysis#data-sketching#java-library

Stars958

Forks223

Last commit1 day ago

Mobius: C# API for SparkC#

C# and F# language binding and extensions for Apache Spark, enabling .NET developers to write Spark driver programs and data processing operations.

#rdd#apache-spark#spark

Stars947

Forks209

Last commit7 months ago

aws-big-data-blogJava

Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.

#code-samples#aws-services#data-engineering

Stars893

Forks613

Last commit4 years ago

camusJava

LinkedIn's previous generation Kafka to HDFS pipeline for batch data ingestion.

#batch-processing#linkedin#kafka

Stars881

Forks451

Last commit5 years ago

PGM-indexC++

A learned index structure enabling fast lookups, range searches, and updates on billions of items with minimal space usage.

#b-tree#range-queries#database

Stars873

Forks103

Last commit1 year ago

Apache Metron (incubating)Java

A centralized platform for security monitoring and analysis, integrating big data technologies for log aggregation, threat detection, and behavioral analytics.

#stream-processing#security-analytics#behavioral-analytics

Stars870

Forks503

Last commit