Apache Spark

93 projects

Showing 36 of 93 projects

A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.

#apache-spark#spark-extensions#spark

Stars767

Forks150

Last commit1 month ago

docker-sparkShell

A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.

#apache-spark#containerization#cluster-computing

Stars757

Forks277

Last commit5 years ago

Apache ToreeScala

A Jupyter Notebook kernel for interactive data exploration and analysis using Apache Spark with Scala.

#apache-spark#spark-integration#jupyter-kernel

Stars751

Forks225

Last commit6 days ago

TensorFramesScala

TensorFlow binding for Apache Spark DataFrames, enabling TensorFlow program execution on Spark data.

#apache-spark#python#tensorflow

Stars744

Forks160

Last commit2 years ago

Mongo-SparkJava

Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.

#apache-spark#connector#spark

Stars730

Forks320

Last commit3 days ago

PySpark Cheatsheet

A quick reference guide to the most commonly used patterns and functions in PySpark SQL.

#apache-spark#reference-guide#data-science

Stars696

Forks211

Last commit3 years ago

Carefully Curated 70 Spark Questions with Additional Optimization Guides (First in the series)

A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.

#apache-spark#spark#performance-optimization

Stars691

Forks80

Last commit4 years ago

quinnPython

A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.

#dataframe-utilities#apache-spark#spark-extensions

Stars687

Forks95

Last commit1 month ago

datacompyPython

A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.

#apache-spark#fugue#spark

Stars654

Forks162

Last commit2 days ago

FlintrockPython

A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.

#apache-spark#devops#apache-spark-cluster

Stars651

Forks120

Last commit1 year ago

SparkR <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.

#apache-spark#r-package#data-science

A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.

#rdd#apache-spark#mapreduce

Stars600

Forks83

Last commit8 years ago

SparkScala

A library enabling Apache Spark to read from and write to Apache HBase tables as external data sources using DataFrames and SQL.

#apache-spark#data-integration#dataframe

Stars546

Forks273

Last commit5 years ago

Spark XMLScala

A library for parsing and querying XML data with Apache Spark SQL and DataFrames.

#apache-spark#dataframe#xml-parser

Stars513

Forks223

Last commit1 year ago

Kotlin for Apache SparkKotlin

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

spark-fast-testsScala

A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.

#apache-spark#spark#unit-testing

Stars457

Forks77

Last commit3 months ago

sparkleHaskell

A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.

#haskell#apache-spark#functional-programming

Stars449

Forks27

Last commit11 months ago

sparkllingClojure

A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.

#apache-spark#functional-programming#data-engineering

Stars447

Forks68

Last commit4 years ago

SparklingPandasPython

A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.

#apache-spark#dataframe#python

Stars361

Forks79

Last commit3 years ago

DelightScala

A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.

#apache-spark#spark#delight

Stars345

Forks58

Last commit2 years ago

Hydrosphere MistScala

A serverless proxy for Spark clusters that provides a functional programming framework and deployment model for Spark applications.

#apache-spark#api#spark

Stars325

Forks69

Last commit3 months ago

neo4j-spark-connectorScala

A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.

#hacktoberfest#apache-spark#neo4j-driver

Stars322

Forks119

Last commit20 hours ago

HivemallJava

A scalable machine learning library that runs on Apache Hive, Spark, and Pig for distributed ML directly in SQL.

#apache-spark#data-science#apache-hive

Stars313

Forks111

Last commit3 years ago

Spark-MongoDBScala

A Spark library for reading and writing data between Spark SQL and MongoDB collections.

#apache-spark#data-integration#dataframe

Stars306

Forks94

Last commit10 years ago

GeniClojure

An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.

#apache-spark#high-performance-computing#spark

Stars294

Forks26

Last commit2 years ago

pysparklingPython

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

#apache-spark#data-science#python

Stars270

Forks45

Last commit1 year ago

isolation-forestScala

A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.

#apache-spark#spark#linkedin

Stars260

Forks54

Last commit1 month ago

spark-connect-goGo

An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.

#spark-connect#apache-spark#protocol-buffers

Stars253

Forks50

Last commit2 months ago

Joblib Apache Spark BackendPython

A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.

#apache-spark#parallel-computing#joblib

Stars250

Forks24

Last commit4 months ago

ruby-sparkRuby

A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.

#rdd#apache-spark#distributed

Stars226

Forks28

Last commit9 years ago

DynaMLScala

A Scala and JVM machine learning toolbox for research, education, and industry with an interactive REPL and end-to-end pipelines.

#research-tool#apache-spark#scala-library

Stars202

Forks45

Last commit3 years ago

Deep SparkJava

A thin integration layer connecting Apache Spark with various NoSQL datastores and JDBC databases.

#apache-spark#data-integration#nosql

Stars197

Forks43

Last commit10 years ago

LiFTScala

A Scala/Spark library for measuring fairness and mitigating bias in large-scale machine learning workflows.

#fairness-ml#apache-spark#spark

Stars173

Forks22

Last commit7 months ago

CrossdataScala

A distributed framework extending Apache Spark with unified SQL access to multiple datastores, optimized connectors, and streaming support.

#apache-spark#scala-library#data-integration

Stars169

Forks51

Last commit6 years ago

ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives

Stars161

Forks19

Last commit9 months ago

Archives Unleashed ToolkitScala

An open-source toolkit for analyzing web archives at scale using Apache Spark.

#apache-spark#web-archives#cultural-heritage

Stars158

Forks33

Last commit7 months ago

PreviousPage 2 of 3

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub