Big Data

258 projects

Showing 36 of 258 projects

data.table <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

A high-performance R package for fast data manipulation of large datasets, extending data.frame with concise syntax and memory efficiency.

#parallel-computing#high-performance#r-package

A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.

#apache-spark#java-library#compressed-bitmap

Stars3.9k

Forks594

Last commit1 day ago

TensorFlowOnSparkPython

Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.

#apache-spark#yahoo#model-training

Stars3.8k

Forks939

Last commit3 years ago

deequScala

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks584

Last commit3 days ago

Apache Heron (incubating)Java

Apache Heron is a real-time, distributed, fault-tolerant stream processing engine developed by Twitter.

#stream-processing#real-time-analytics#distributed-systems

Stars3.6k

Forks581

Last commit3 years ago

awesome-etl list

A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.

#open-source#workflow-orchestration#data-integration

Stars3.6k

Forks372

Last commit2 months ago

gleamGo

A high-performance distributed map/reduce system with DAG execution, written in Go, supporting standalone or distributed modes.

#stream-processing#cluster-computing#distributed-systems

Stars3.6k

Forks292

Last commit14 days ago

ScaldingScala

A Scala API for Cascading that simplifies writing Hadoop MapReduce jobs with Scala integration.

#cascading#mapreduce#functional-programming

Stars3.5k

Forks699

Last commit3 years ago

KoalasPython

Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

#apache-spark#spark#mlflow

Stars3.4k

Forks369

Last commit2 years ago

glowGo

A distributed computation system written in Go for parallel and cluster processing, similar to Hadoop MapReduce and Spark.

#mapreduce#cluster-computing#go-library

Stars3.2k

Forks249

Last commit7 years ago

HugeGraphJava

A fast, highly-scalable graph database supporting over 10 billion vertices and edges with OLTP capabilities and dual Gremlin/Cypher query language support.

#database#graph#rocksdb

Stars3.1k

Forks625

Last commit18 hours ago

roaringGo

A Go implementation of Roaring bitmaps, a compressed bitmap data structure for fast set operations on large integer datasets.

#performance-optimization#roaring-bitmaps#go-library

A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.

#apache-spark#spark#rest-api

Stars2.8k

Forks970

Last commit4 months ago

flumeJava

A distributed service for efficiently collecting, aggregating, and moving large amounts of log-like data.

#stream-processing#apache#library

Stars2.6k

Forks1.5k

Last commit2 days ago

PyGraphistryPython

A Python library for loading, shaping, embedding, and exploring large graphs with GPU-accelerated visualization and analytics.

#networkx#graph#graph-query-language

A real-time distributed analytical database built entirely on bitmaps for low-latency queries on fresh data.

#real-time-database#stream-processing#bitmap

Stars2.5k

Forks239

Last commit2 years ago

fastcacheGo

A fast, thread-safe in-memory cache for Go designed to handle massive entry counts with minimal garbage collection overhead.

#systems-programming#in-memory-cache#fast

Stars2.4k

Forks195

Last commit1 month ago

Apache SedonaJava

A cluster computing framework for processing large-scale geospatial data within Apache Spark, Flink, and other big data systems.

#apache-flink#hacktoberfest#apache-spark

A distributed, multi-tenant gateway providing serverless SQL on data warehouses and lakehouses.

#hiveserver2-alternative#hacktoberfest#spark

A Scala library providing abstract algebra types and structures for building aggregation systems and analytics.

#functional-programming#monoids#distributed-systems

Stars2.3k

Forks346

Last commit8 months ago

Gobblin from LinkedInJava

A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.

#stream-processing#apache#data-lifecycle-management

A distributed data integration framework for big data ecosystems, handling ingestion, replication, organization, and lifecycle management for both streaming and batch data.

#stream-processing#apache#data-lifecycle-management

Stars2.3k

Forks749

Last commit