Big Data

258 projects

Showing 36 of 258 projects

A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.

#haskell#apache-spark#functional-programming

Stars449

Forks27

Last commit11 months ago

sparkllingClojure

A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.

#apache-spark#functional-programming#data-engineering

Stars447

Forks68

Last commit4 years ago

DataStax PHP DriverC

A modern, feature-rich PHP client library for Apache Cassandra using Cassandra's binary protocol and CQL v3.

#database-driver#nosql#async-io

Stars436

Forks151

Last commit2 years ago

amazon-kinesis-producerC++

A Java library for building efficient and reliable producer applications for Amazon Kinesis Data Streams.

#java-library#real-time-processing#producer

A secure time series database backed by Apache Accumulo with Grafana integration for data visualization.

#hacktoberfest#secure-database#distributed-storage

Stars394

Forks110

Last commit2 months ago

amazon-kinesis-client-pythonPython

A Python interface to the Amazon Kinesis Client Library for building distributed applications that process streaming data reliably at scale.

#stream-processing#kinesis#python-library

Stars376

Forks228

Last commit

spatial-framework-for-hadoopJava

A framework enabling spatial data analysis within Hadoop ecosystems using Hive and SparkSQL.

#geospatial#java#gis

Stars376

Forks158

Last commit14 days ago

SparklingPandasPython

A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.

#apache-spark#dataframe#python

Stars361

Forks79

Last commit3 years ago

ConjectureJava

A framework for building scalable machine learning models in Hadoop using the Scalding DSL.

#recommender-systems#classification#cross-validation

Stars359

Forks56

Last commit8 years ago

Apache Spot (incubating)Python

Open-source platform for network security analytics using flow and packet analysis to detect unknown threats at cloud scale.

#security-analytics#telemetry#spot

Stars356

Forks226

Last commit3 years ago

Apache ApexJava

A unified platform for big data stream and batch processing on Hadoop YARN with enterprise-grade operability.

#stream-processing#batch-processing#real-time-analytics

Stars350

Forks170

Last commit5 years ago

DelightScala

A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.

#apache-spark#spark#delight

Stars345

Forks58

Last commit2 years ago

Hydrosphere MistScala

A serverless proxy for Spark clusters that provides a functional programming framework and deployment model for Spark applications.

#apache-spark#api#spark

Stars325

Forks69

Last commit3 months ago

neo4j-spark-connectorScala

A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.

#hacktoberfest#apache-spark#neo4j-driver

A streaming JsonPath processor for Java that extracts JSON data without loading entire documents into memory.

#java-library#non-blocking#streaming-json

Stars317

Forks58

Last commit2 years ago

HivemallJava

A scalable machine learning library that runs on Apache Hive, Spark, and Pig for distributed ML directly in SQL.

#apache-spark#data-science#apache-hive

Stars313

Forks111

Last commit3 years ago

Spark-MongoDBScala

A Spark library for reading and writing data between Spark SQL and MongoDB collections.

#apache-spark#data-integration#dataframe

Stars306

Forks94

Last commit10 years ago

packetpigPython

An open-source big data security analytics tool that analyzes network packet capture (pcap) files using Apache Pig.

#security-analytics#intrusion-detection#data-visualization

Stars298

Forks84

Last commit8 years ago

GearpumpScala

A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.

#akka#distributed-systems#low-latency

Stars297

Forks89

Last commit8 years ago

GeniClojure

An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.

#apache-spark#high-performance-computing#spark

Stars294

Forks26

Last commit2 years ago

OpenMessaging Spec

A vendor-neutral, language-independent specification for building interoperable messaging and streaming applications across heterogeneous systems.

#tracing#iot#push

Stars287

Forks54

Last commit3 years ago

tigonC++

An open-source real-time stream processing framework combining high-throughput event processing with low-latency SQL-like streaming queries.

#stream-processing#event-processing#real-time-analytics

Stars284

Forks33

Last commit9 years ago

externalsortinginjavaJava

A Java library for sorting very large files using external-memory algorithms and multiple cores.

#multi-core#csv-processing#java-library

Stars262

Forks103

Last commit5 months ago

isolation-forestScala

A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.

#apache-spark#spark#linkedin

Stars260

Forks54

Last commit1 month ago

ferryPython

Define, run, and deploy big data applications on AWS, OpenStack, and local machines using Docker.

#devops#spark#data-science

Stars254

Forks25

Last commit11 years ago

spark-connect-goGo

An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.

#spark-connect#apache-spark#protocol-buffers

Stars253

Forks50

Last commit2 months ago

GradoopJava

An open-source research framework for distributed temporal graph analytics built on Apache Flink.

#apache-flink#graph#temporal-graphs

Stars251

Forks85

Last commit6 months ago

Apache SamoaJava

A distributed streaming machine learning framework for mining big data streams with abstraction over processing engines.

#apache-s4#real-time-analytics#samoa

Stars251

Forks102

Last commit2 months ago

PetrelPython

A Python toolkit for developing, testing, and managing Apache Storm streaming data processing topologies.

#stream-processing#real-time-analytics#distributed-systems

Stars247

Forks68

Last commit3 years ago

KafkaScala

A collection of connectors enabling Apache HBase integration with Kafka, Spark, and other data processing systems.

#database#kafka-connector#data-integration

Stars246

Forks179

Last commit11 days ago

MorpheusJava

A high-performance, type-safe DataFrame library for the JVM enabling large-scale data analysis with parallel processing capabilities.

#scientific-computing#parallel-computing#finance

Stars245

Forks24

Last commit2 years ago

PrestoJava

A high-performance Presto connector for querying HBase with 10-100x faster performance than other open-source alternatives.

#batch-gets#performance-optimization#hbase

Stars242

Forks102

Last commit3 years ago

hdfs-duJavaScript

Interactive visualization tool for monitoring Hadoop HDFS cluster usage and file storage efficiency.

#d3-js#javascript-infovis-toolkit#storage-optimization

Stars228

Forks82

Last commit5 years ago

ruby-sparkRuby

A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.

#rdd#apache-spark#distributed

Stars226

Forks28

Last commit9 years ago

pycascadingPython

A Python wrapper for Cascading that enables building and controlling Hadoop data processing workflows entirely in Python.

#cascading#mapreduce#workflow-engine

Stars220

Forks35

Last commit6 years ago

TDengineGUIJavaScript

A cross-platform desktop GUI for managing and querying TDengine time-series databases.

#iot#desktop-application#database-gui

Stars220

Forks78

Last commit27 days ago

PreviousPage 5 of 8Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub