Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Tags
  3. Big Data

Big Data

219 projects

Showing 36 of 219 projects

PredictionIO Ruby SDK
PredictionIO Ruby SDKRuby

A Ruby SDK for integrating applications with Apache PredictionIO's Event Server and Engine APIs.

#personalization#event-tracking#predictionio
Stars191
Forks37
Last commit7 years ago
HBase
HBase

A curated list of awesome HBase projects, clients, frameworks, tools, and resources.

#data-storage#data-integration#hbase
Stars178
Forks41
Last commit22 days ago
treeviz
treevizTypeScript

A JavaScript library for creating interactive tree diagrams with dynamic data updates and customizable visualizations.

#family-tree#represent-tree-diagrams#svg
Stars174
Forks28
Last commit2 years ago
Big Data For Chimps
Big Data For ChimpsRuby

A practical guide to exploratory data analytics using Hadoop with Pig and Ruby for terabyte-scale data processing.

#exploratory-analysis#data-science#terabyte-processing
Stars169
Forks63
Last commit
Crossdata
CrossdataScala

A distributed framework extending Apache Spark with unified SQL access to multiple datastores, optimized connectors, and streaming support.

#apache-spark#scala-library#data-integration
Stars169
Forks51
Last commit6 years ago
Streamline
StreamlineJava

A visual development platform for building, deploying, and managing streaming analytics applications with multiple engine bindings.

#stream-processing#flink#storm
Stars167
Forks95
Last commit2 years ago
DistributedR
DistributedRR

A scalable high-performance platform for R that enables large-scale machine learning, statistical analysis, and graph processing across clusters.

#statistical-analysis#graph-processing#high-performance-computing
Stars162
Forks54
Last commit
Haeinsa
HaeinsaJava

A linearly scalable multi-row, multi-table transaction library for HBase with serializable isolation.

#transaction-library#database#concurrency-control
Stars160
Forks41
Last commit9 years ago
Archives Unleashed Toolkit
Archives Unleashed ToolkitScala

An open-source toolkit for analyzing web archives at scale using Apache Spark.

#apache-spark#web-archives#cultural-heritage
Stars158
Forks34
Last commit6 months ago
amazon-kinesis-aggregators
amazon-kinesis-aggregatorsJava

A Java framework for creating real-time time series aggregations from Amazon Kinesis streams.

#time-series-aggregation#real-time-analytics#amazon-kinesis
Stars152
Forks29
Last commit
Kyrix
KyrixJavaScript

A framework for creating interactive, details-on-demand data visualizations that scale to millions of records with a declarative API.

#pan-zoom#web-embedding#visualization-grammar
Stars151
Forks25
Last commit3 years ago
TeeBI
TeeBIPascal

A multi-platform data-mining and visualization library for RAD Studio, supporting in-memory databases, pivot tables, and big data.

#database#complex-structures#pivot-tables
Stars149
Forks60
Last commit2 days ago
binarypig
binarypigJavaScript

A scalable malware processing and analytics platform built on Hadoop Pig for binary data extraction and analysis.

#security-analytics#malware-analysis#binary-analysis
Stars144
Forks42
Last commit12 years ago
low-gc-membuffers
low-gc-membuffersJava

A Java library for creating in-memory circular buffers using direct ByteBuffers to minimize garbage collection overhead.

#low-gc#direct-bytebuffer#java-library
Stars142
Forks16
Last commit4 years ago
UCLA: Tools in Data Science (STATS 418)
UCLA: Tools in Data Science (STATS 418)HTML

Course materials for UCLA's STATS 418 - Tools in Data Science covering R packages, machine learning libraries, databases, and reproducibility tools.

#analytical-databases#data-science#r-programming
Stars138
Forks63
Last commit
Apex
ApexJava

Operator and codec library for building real-time streaming applications on Apache Apex.

#apex#java#operator-library
Stars135
Forks142
Last commit6 years ago
Hyperion History API
Hyperion History APITypeScript

A scalable full history and state API solution for Antelope (formerly EOSIO) blockchain networks.

#history#api#eos
Stars134
Forks87
Last commit3 days ago
bigmemory
bigmemoryC++

An R package for creating, storing, and manipulating massive matrices using shared memory and memory-mapped files.

#parallel-computing#r-package#shared-memory
Stars133
Forks25
Last commit1 day ago
h-rider
h-riderJava

A UI application for viewing and manipulating data stored in Apache HBase distributed databases.

#hbase#java#database-gui
Stars133
Forks44
Last commit9 years ago
mupd8(muppet)
mupd8(muppet)Scala

A MapReduce-style framework for processing fast/streaming data, implementing the MapUpdate model.

#stream-processing#mapreduce#data-framework
Stars128
Forks35
Last commit5 years ago
parquet
parquetGo

A Go library that generates type-safe Parquet readers and writers from Go structs or existing Parquet files.

#parquet#data-serialization#dremel
Stars127
Forks13
Last commit1 year ago
avatica
avaticaGo

A Go database/sql driver for Apache Avatica server, enabling Go applications to connect to Phoenix and other Avatica-backed databases.

#database-driver#geospatial#hbase
Stars124
Forks35
Last commit8 days ago
Apache DataFu
Apache DataFuJava

A collection of libraries for large-scale data processing in Hadoop ecosystems, including Spark, Pig, and incremental MapReduce.

#apache-spark#mapreduce#user-defined-functions
Stars124
Forks65
Last commit21 days ago
RHive
RHiveR

An R extension for distributed computing using Apache Hive, enabling HQL queries in R and R functions in Hive.

#cluster-computing#apache-hive#rserve
Stars122
Forks62
Last commit9 years ago
ddR
ddRR

A unified R API for writing parallel and distributed applications across different backends like parallel, HP Distributed R, and SparkR.

#parallel-computing#high-performance-computing#api
Stars119
Forks17
Last commit8 years ago
spark-connect-rs
spark-connect-rsRust

An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.

#spark-connect#apache-spark#spark
Stars116
Forks24
Last commit1 year ago
mpich2-yarn
mpich2-yarnJava

Run MPI programs on Hadoop YARN clusters using MPICH-3.1.2 and SSH for distributed computing.

#high-performance-computing#mpi#cluster-computing
Stars115
Forks58
Last commit8 years ago
yurita
yuritaScala

An open-source framework for developing large-scale anomaly detection models using Apache Spark.

#statistical-models#apache-spark#security-analytics
Stars109
Forks30
Last commit6 years ago
ganitha
ganithaScala

A Scalding library for machine learning and statistical analysis, featuring Mahout vector integration, K-Means clustering, and Naive-Bayes classifiers.

#statistical-analysis#classification#scalding
Stars109
Forks12
Last commit11 years ago
GaussianMixtures
GaussianMixturesJulia

A Julia package for efficient large-scale Gaussian Mixture Models with support for diagonal/full covariance, parallel training, and variational Bayes.

#julia#parallel-computing#expectation-maximization
Stars107
Forks40
Last commit
TDengineGUI for 2.x & 3.x
TDengineGUI for 2.x & 3.xJavaScript

A cross-platform desktop GUI for managing and querying TDengine databases.

#iot#desktop-application#data-management
Stars92
Forks16
Last commit3 years ago
Docker for beginners
Docker for beginnersJupyter Notebook

A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.

#google-colab#mapreduce-bash#apache-spark
Stars84
Forks27
Last commit1 month ago
php-tdengine
php-tdenginePHP

A PHP client extension for the TDengine big data engine, with Swoole coroutine support.

#database-driver#swoole#tdengine-client
Stars78
Forks9
Last commit3 years ago
akela
akelaJava

Mozilla's utility library for Hadoop, HBase, Pig, and related big data technologies.

#mapreduce#hbase#java
Stars77
Forks31
Last commit12 years ago
Beetest
BeetestJava

A simple utility for testing Apache Hive scripts locally without requiring Java development skills.

#unit-testing#apache-hive#data-engineering
Stars73
Forks23
Last commit9 years ago
count-min-log
count-min-logGo

Go implementation of Count-Min-Log sketch for improved approximate counting of low-frequency events.

#probabilistic-data-structures#stream-processing#go-library
Stars70
Forks6
Last commit1 year ago
PreviousPage 6 of 7Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub
11 years ago
10 years ago
5 years ago
9 years ago
5 months ago
#Apache Spark59
#Data Processing58
#Distributed Computing50
#Hadoop41
#Spark40
#Machine Learning39
#Scala37
#Distributed Systems32
#Data Science29
#Data Engineering29
#Java29
#Stream Processing27