Showing 36 of 77 projects
A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.
A Jupyter Notebook kernel for interactive data exploration and analysis using Apache Spark with Scala.
TensorFlow binding for Apache Spark DataFrames, enabling TensorFlow program execution on Spark data.
Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.
A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.
A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.
A quick reference guide to the most commonly used patterns and functions in PySpark SQL.
A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.
A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.
An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.
A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.
A library enabling Apache Spark to read from and write to Apache HBase tables as external data sources using DataFrames and SQL.
A library for parsing and querying XML data with Apache Spark SQL and DataFrames.
Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.
A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.
A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.
A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.
A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.
A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.
A serverless proxy for Spark clusters that provides a functional programming framework and deployment model for Spark applications.
A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.
A scalable machine learning library that runs on Apache Hive, Spark, and Pig for distributed ML directly in SQL.
A Spark library for reading and writing data between Spark SQL and MongoDB collections.
An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.
A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.
An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.
A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.
A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.
A Scala and JVM machine learning toolbox for research, education, and industry with an interactive REPL and end-to-end pipelines.
A thin integration layer connecting Apache Spark with various NoSQL datastores and JDBC databases.
A Scala/Spark library for measuring fairness and mitigating bias in large-scale machine learning workflows.
A distributed framework extending Apache Spark with unified SQL access to multiple datastores, optimized connectors, and streaming support.
An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.
An open-source toolkit for analyzing web archives at scale using Apache Spark.
A Spark library for reading from and writing to Google BigQuery using DataFrames and SQL.
A collection of libraries for large-scale data processing in Hadoop ecosystems, including Spark, Pig, and incremental MapReduce.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.