Big Data

258 projects

Showing 36 of 258 projects

RHadoop

A collection of R packages for interacting with Hadoop ecosystems, enabling big data analysis from R.

#mapreduce#data-science#hbase

Stars760

Forks275

Last commit10 years ago

gohbaseGo

A pure Go client library for interacting with HBase databases, supporting HBase >= 1.0.

#database-driver#go-library#data-access

Stars759

Forks224

Last commit4 days ago

docker-sparkShell

A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.

#apache-spark#containerization#cluster-computing

Stars757

Forks277

Last commit5 years ago

GearpumpScala

A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.

#stream-processing#akka#cluster-computing

Stars756

Forks150

Last commit4 days ago

Apache ToreeScala

A Jupyter Notebook kernel for interactive data exploration and analysis using Apache Spark with Scala.

#apache-spark#spark-integration#jupyter-kernel

Stars751

Forks225

Last commit7 days ago

TensorFramesScala

TensorFlow binding for Apache Spark DataFrames, enabling TensorFlow program execution on Spark data.

#apache-spark#python#tensorflow

Stars744

Forks160

Last commit2 years ago

NussknackerScala

A low-code visual tool for domain experts to build, run, and monitor real-time decision algorithms on streaming data.

#apache-flink#stream-processing#touk

Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.

#apache-spark#connector#spark

Stars730

Forks320

Last commit4 days ago

PySpark Cheatsheet

A quick reference guide to the most commonly used patterns and functions in PySpark SQL.

#apache-spark#reference-guide#data-science

Stars696

Forks211

Last commit3 years ago

Carefully Curated 70 Spark Questions with Additional Optimization Guides (First in the series)

A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.

#apache-spark#spark#performance-optimization

Stars691

Forks80

Last commit4 years ago

quinnPython

A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.

#dataframe-utilities#apache-spark#spark-extensions

Stars687

Forks95

Last commit1 month ago

BlinkDBScala

A large-scale data warehouse system that provides approximate query answers with error bounds on massive datasets up to 300x faster than Hive.

#spark#sampling#performance-optimization

Stars660

Forks121

Last commit12 years ago

Intel® oneAPI Data Analytics LibraryC++

A high-performance C++/DPC++ library for accelerated machine learning on CPUs, GPUs, and distributed systems.

#oneapi#hacktoberfest#ai-machine-learning

A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.

#apache-spark#devops#apache-spark-cluster

Stars651

Forks120

Last commit1 year ago

PelotonGo

A unified resource scheduler for co-scheduling batch, stateless, and stateful workloads in a single cluster to maximize resource utilization.

#stateless-workloads#container-orchestration#batch-processing

Stars646

Forks64

Last commit3 years ago

vroomC++

The fastest delimited file reader for R, using lazy loading and multi-threading to achieve speeds over 1 GB/sec.

#delimited-files#csv-reader#high-performance

Stars642

Forks74

Last commit1 month ago

SparkR <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.

#apache-spark#r-package#data-science

Mirror of Apache Giraph

#java#big-data#giraph

Stars620

Forks297

Last commit3 years ago

emr-bootstrap-actionsShell

A collection of sample bootstrap action scripts for configuring applications on Amazon EMR clusters.

#infrastructure-automation#hadoop-ecosystem#cloud-computing

A fully asynchronous, non-blocking, thread-safe, high-performance Java client for HBase.

#java-library#high-performance#non-blocking

Stars610

Forks301

Last commit3 years ago

flamboClojure

A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.

#rdd#apache-spark#mapreduce

Stars600

Forks83

Last commit8 years ago

hindexJava

A server-side secondary index implementation for Apache HBase 0.94.8 using co-processors to enable efficient indexed queries.

#hbase#java#secondary-index

Stars589

Forks284

Last commit9 years ago

OpenSOC

An open-source security analytics platform that integrates big data technologies for centralized security monitoring, threat detection, and investigation.

#security-analytics#real-time-processing#behavioral-analytics

Stars584

Forks187

Last commit6 years ago

xgboost <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">C++

An optimized distributed gradient boosting library for fast and accurate machine learning on large datasets.

#parallel-computing#gbdt#ml-library

A high-performance, disk-backed queue library using memory-mapped files for fast, persistent, and thread-safe data processing.

#java-library#message-queue#persistent-queue

Stars567

Forks217

Last commit4 years ago

PigPenClojure

A Clojure library for writing map-reduce queries that compile to Apache Pig or Cascading, enabling distributed data processing with Clojure syntax.

#cascading#clojure#big-data

Stars564

Forks52

Last commit3 years ago

EventsimScala

A Scala-based event data simulator that generates realistic web traffic for a fake music streaming service.

#stream-processing#performance-testing#fake-data

Stars547

Forks142

Last commit5 months ago

SparkScala

A library enabling Apache Spark to read from and write to Apache HBase tables as external data sources using DataFrames and SQL.

#apache-spark#data-integration#dataframe

Stars546

Forks273

Last commit5 years ago

StreamizC#

A .NET stream processing library for Apache Kafka, providing a Kafka Streams-like API for building real-time applications.

#stream-processing#event-driven#kafka-streams-dotnet

A collection of GIS tools for spatial analysis of big data using Hadoop, integrating with ArcGIS Geoprocessing.

#arcgis#geospatial#apache-hive

Stars524

Forks251

Last commit4 years ago

Spark XMLScala

A library for parsing and querying XML data with Apache Spark SQL and DataFrames.

#apache-spark#dataframe#xml-parser

Stars513

Forks223

Last commit1 year ago

streamDMScala

A Spark Streaming library for mining big data streams with incremental learning algorithms.

#classification#stream-mining#data-streams

Stars497

Forks141

Last commit3 years ago

Kotlin for Apache SparkKotlin

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

CogCompNLPJava

A comprehensive suite of Java NLP libraries and tools for text annotation, feature extraction, and language processing tasks.

#part-of-speech-tagging#cogcomp#java-library

Stars479

Forks143

Last commit3 years ago

LipstickJavaScript

A visualization framework for Apache Pig workflows that combines graphical depictions with real-time execution information.

#hadoop-ecosystem#data-engineering#big-data

Stars466

Forks133

Last commit3 years ago

spark-fast-testsScala

A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.

#apache-spark#spark#unit-testing

Stars457

Forks77

Last commit3 months ago

PreviousPage 4 of 8Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub