Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Tags
  3. Big Data

Big Data

219 projects

Showing 36 of 219 projects

Timely
TimelyJava

A secure time series database backed by Apache Accumulo with Grafana integration for data visualization.

#hacktoberfest#secure-database#distributed-storage
Stars392
Forks110
Last commit26 days ago
spatial-framework-for-hadoop
spatial-framework-for-hadoopJava

A framework enabling spatial data analysis within Hadoop ecosystems using Hive and SparkSQL.

#geospatial#java#gis
Stars376
Forks158
Last commit13 days ago
amazon-kinesis-client-python
amazon-kinesis-client-pythonPython

A Python interface to the Amazon Kinesis Client Library for building distributed applications that process streaming data reliably at scale.

#stream-processing#kinesis#python-library
Stars376
Forks228
Last commit
SparklingPandas
SparklingPandasPython

A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.

#apache-spark#dataframe#python
Stars361
Forks79
Last commit2 years ago
Conjecture
ConjectureJava

A framework for building scalable machine learning models in Hadoop using the Scalding DSL.

#recommender-systems#classification#cross-validation
Stars360
Forks56
Last commit8 years ago
Apache Spot (incubating)
Apache Spot (incubating)Python

Open-source platform for network security analytics using flow and packet analysis to detect unknown threats at cloud scale.

#security-analytics#telemetry#spot
Stars356
Forks226
Last commit3 years ago
Apache Apex
Apache ApexJava

A unified platform for big data stream and batch processing on Hadoop YARN with enterprise-grade operability.

#stream-processing#batch-processing#real-time-analytics
Stars350
Forks170
Last commit5 years ago
Delight
DelightScala

A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.

#apache-spark#spark#delight
Stars346
Forks58
Last commit2 years ago
Hydrosphere Mist
Hydrosphere MistScala

A serverless proxy for Spark clusters that provides a functional programming framework and deployment model for Spark applications.

#apache-spark#api#spark
Stars325
Forks70
Last commit1 month ago
neo4j-spark-connector
neo4j-spark-connectorScala

A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.

#hacktoberfest#apache-spark#neo4j-driver
Stars319
Forks119
Last commit3 days ago
JsonSurfer
JsonSurferJava

A streaming JsonPath processor for Java that extracts JSON data without loading entire documents into memory.

#java-library#non-blocking#streaming-json
Stars316
Forks58
Last commit2 years ago
Hivemall
HivemallJava

A scalable machine learning library that runs on Apache Hive, Spark, and Pig for distributed ML directly in SQL.

#apache-spark#data-science#apache-hive
Stars313
Forks111
Last commit3 years ago
Spark-MongoDB
Spark-MongoDBScala

A Spark library for reading and writing data between Spark SQL and MongoDB collections.

#apache-spark#data-integration#dataframe
Stars305
Forks94
Last commit9 years ago
packetpig
packetpigPython

An open-source big data security analytics tool that analyzes network packet capture (pcap) files using Apache Pig.

#security-analytics#intrusion-detection#data-visualization
Stars298
Forks84
Last commit8 years ago
Gearpump
GearpumpScala

A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.

#akka#distributed-systems#low-latency
Stars297
Forks89
Last commit7 years ago
Geni
GeniClojure

An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.

#apache-spark#high-performance-computing#spark
Stars295
Forks26
Last commit2 years ago
OpenMessaging Spec
OpenMessaging Spec

A vendor-neutral, language-independent specification for building interoperable messaging and streaming applications across heterogeneous systems.

#tracing#iot#push
Stars287
Forks54
Last commit2 years ago
tigon
tigonC++

An open-source real-time stream processing framework combining high-throughput event processing with low-latency SQL-like streaming queries.

#stream-processing#event-processing#real-time-analytics
Stars284
Forks33
Last commit9 years ago
externalsortinginjava
externalsortinginjavaJava

A Java library for sorting very large files using external-memory algorithms and multiple cores.

#multi-core#csv-processing#java-library
Stars263
Forks103
Last commit4 months ago
isolation-forest
isolation-forestScala

A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.

#apache-spark#spark#linkedin
Stars259
Forks54
Last commit1 month ago
ferry
ferryPython

Define, run, and deploy big data applications on AWS, OpenStack, and local machines using Docker.

#devops#spark#data-science
Stars254
Forks25
Last commit11 years ago
spark-connect-go
spark-connect-goGo

An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.

#spark-connect#apache-spark#protocol-buffers
Stars253
Forks50
Last commit24 days ago
Apache Samoa
Apache SamoaJava

A distributed streaming machine learning framework for mining big data streams with abstraction over processing engines.

#apache-s4#real-time-analytics#samoa
Stars251
Forks102
Last commit24 days ago
Gradoop
GradoopJava

An open-source research framework for distributed temporal graph analytics built on Apache Flink.

#apache-flink#graph#temporal-graphs
Stars251
Forks86
Last commit4 months ago
Petrel
PetrelPython

A Python toolkit for developing, testing, and managing Apache Storm streaming data processing topologies.

#stream-processing#real-time-analytics#distributed-systems
Stars247
Forks68
Last commit3 years ago
Morpheus
MorpheusJava

A high-performance, type-safe DataFrame library for the JVM enabling large-scale data analysis with parallel processing capabilities.

#scientific-computing#parallel-computing#finance
Stars245
Forks24
Last commit2 years ago
Kafka
KafkaScala

A collection of connectors enabling Apache HBase integration with Kafka, Spark, and other data processing systems.

#database#kafka-connector#data-integration
Stars244
Forks179
Last commit24 days ago
Presto
PrestoJava

A high-performance Presto connector for querying HBase with 10-100x faster performance than other open-source alternatives.

#batch-gets#performance-optimization#hbase
Stars242
Forks102
Last commit3 years ago
hdfs-du
hdfs-duJavaScript

Interactive visualization tool for monitoring Hadoop HDFS cluster usage and file storage efficiency.

#d3-js#javascript-infovis-toolkit#storage-optimization
Stars228
Forks82
Last commit5 years ago
ruby-spark
ruby-sparkRuby

A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.

#rdd#apache-spark#distributed
Stars226
Forks28
Last commit8 years ago
pycascading
pycascadingPython

A Python wrapper for Cascading that enables building and controlling Hadoop data processing workflows entirely in Python.

#cascading#mapreduce#workflow-engine
Stars221
Forks35
Last commit6 years ago
TDengineGUI
TDengineGUIJavaScript

A cross-platform desktop GUI for managing and querying TDengine time-series databases.

#iot#desktop-application#database-gui
Stars220
Forks79
Last commit2 years ago
hadoop-pcap
hadoop-pcapJava

A Hadoop library for reading and processing packet capture (PCAP) files in MapReduce jobs and Hive queries.

#mapreduce#serde#pcap
Stars216
Forks101
Last commit3 years ago
Crunch
CrunchGo

A Go-based toolkit for fast ETL and feature extraction on Hadoop, optimized for rapid development and execution.

#hive#pig#feature-extraction
Stars212
Forks16
Last commit11 years ago
inviso
invisoJavaScript

A lightweight tool for searching Hadoop jobs, visualizing performance, and viewing cluster utilization.

#job-visualization#rest-api#performance-analysis
Stars205
Forks64
Last commit3 years ago
Deep Spark
Deep SparkJava

A thin integration layer connecting Apache Spark with various NoSQL datastores and JDBC databases.

#apache-spark#data-integration#nosql
Stars197
Forks42
Last commit10 years ago
PreviousPage 5 of 7Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub
6 days ago
#Apache Spark59
#Data Processing58
#Distributed Computing50
#Hadoop41
#Spark40
#Machine Learning39
#Scala37
#Distributed Systems32
#Data Science29
#Data Engineering29
#Java29
#Stream Processing27