Showing 36 of 219 projects
A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.
A Jupyter Notebook kernel for interactive data exploration and analysis using Apache Spark with Scala.
TensorFlow binding for Apache Spark DataFrames, enabling TensorFlow program execution on Spark data.
A low-code visual tool for domain experts to build, run, and monitor real-time decision algorithms on streaming data.
Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.
A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.
A quick reference guide to the most commonly used patterns and functions in PySpark SQL.
A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.
A large-scale data warehouse system that provides approximate query answers with error bounds on massive datasets up to 300x faster than Hive.
A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.
A high-performance C++/DPC++ library for accelerated machine learning on CPUs, GPUs, and distributed systems.
A unified resource scheduler for co-scheduling batch, stateless, and stateful workloads in a single cluster to maximize resource utilization.
The fastest delimited file reader for R, using lazy loading and multi-threading to achieve speeds over 1 GB/sec.
An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.
A collection of sample bootstrap action scripts for configuring applications on Amazon EMR clusters.
A fully asynchronous, non-blocking, thread-safe, high-performance Java client for HBase.
A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.
A server-side secondary index implementation for Apache HBase 0.94.8 using co-processors to enable efficient indexed queries.
An open-source security analytics platform that integrates big data technologies for centralized security monitoring, threat detection, and investigation.
An optimized distributed gradient boosting library for fast and accurate machine learning on large datasets.
A high-performance, disk-backed queue library using memory-mapped files for fast, persistent, and thread-safe data processing.
A Clojure library for writing map-reduce queries that compile to Apache Pig or Cascading, enabling distributed data processing with Clojure syntax.
A library enabling Apache Spark to read from and write to Apache HBase tables as external data sources using DataFrames and SQL.
A Scala-based event data simulator that generates realistic web traffic for a fake music streaming service.
A .NET stream processing library for Apache Kafka, providing a Kafka Streams-like API for building real-time applications.
A collection of GIS tools for spatial analysis of big data using Hadoop, integrating with ArcGIS Geoprocessing.
A library for parsing and querying XML data with Apache Spark SQL and DataFrames.
A Spark Streaming library for mining big data streams with incremental learning algorithms.
Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.
A comprehensive suite of Java NLP libraries and tools for text annotation, feature extraction, and language processing tasks.
A visualization framework for Apache Pig workflows that combines graphical depictions with real-time execution information.
A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.
A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.
A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.
A modern, feature-rich PHP client library for Apache Cassandra using Cassandra's binary protocol and CQL v3.
A Java library for building efficient and reliable producer applications for Amazon Kinesis Data Streams.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.