Apache Spark

The "Awesome Apache Spark" project is a curated resource list designed to support developers and data engineers using Apache Spark, a powerful open-source unified analytics engine for large-scale data processing. This list includes a variety of resources such as libraries, frameworks, tutorials, and tools that facilitate data processing, machine learning, and stream processing. It benefits both beginners and experienced professionals by providing essential information and tools to enhance productivity and efficiency in data workflows. Whether you're looking to optimize your data processing tasks or explore advanced analytics capabilities, this collection offers valuable insights and resources to help you succeed with Apache Spark.

big-datadata-processingmachine-learningstream-processingdata-engineeringspark-sqldata-analytics

RSS View on GitHub

1.9k stars345 forks0 contributorsUpdated

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

Apache Spark

big-datadata-processingmachine-learningstream-processingdata-engineeringspark-sqldata-analytics

RSS View on GitHub

1.9k stars345 forks0 contributorsUpdated

Language Bindings

7 projects

Kotlin for Apache Spark

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

.NET for Apache Spark

.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.

#apache-spark#spark#dataframe

Stars2,097

Forks333

Last commit2 months ago

sparklyr

An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.

#apache-spark#distributed#dplyr

Stars971

Forks308

Last commit22 days ago

sparkle

A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.

#haskell#apache-spark#functional-programming

Stars449

Forks27

Last commit11 months ago

spark-connect-rs

An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.

#spark-connect#apache-spark#spark

Stars116

Forks24

Last commit1 year ago

spark-connect-go

An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.

#spark-connect#apache-spark#protocol-buffers

Stars253

Forks50

Last commit2 months ago

spark-connect-csharp

A thin C# gRPC client for communicating with Apache Spark Connect servers, enabling .NET applications to interact with Spark clusters.

#spark-connect#apache-spark#apache

Stars2

Forks0

Last commit2 years ago

Notebooks and IDEs

4 projects

almond

almond.sh

Apache Zeppelin

zeppelin.incubator.apache.org

Polynote

polynote.org

sparkmagic

Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.

#apache-spark#spark#notebook

Stars1,364

Forks443

Last commit10 months ago

General Purpose Libraries

5 projects

itachi

#postgres#spark#trino

Stars63

Forks8

Last commit2 years ago

spark-daria

A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.

#apache-spark#spark-extensions#spark

Stars767

Forks150

Last commit1 month ago

quinn

A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.

#dataframe-utilities#apache-spark#spark-extensions

Stars687

Forks95

Last commit1 month ago

Apache DataFu

A collection of libraries for large-scale data processing in Hadoop ecosystems, including Spark, Pig, and incremental MapReduce.

#apache-spark#mapreduce#user-defined-functions

Stars124

Forks66

Last commit15 days ago

Joblib Apache Spark Backend

A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.

#apache-spark#parallel-computing#joblib

Stars250

Forks24

Last commit4 months ago

SQL Data Sources

3 projects

Spark XML

A library for parsing and querying XML data with Apache Spark SQL and DataFrames.

#apache-spark#dataframe#xml-parser

Stars513

Forks223

Last commit1 year ago

DataStax Spark Cassandra Connector

A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.

#apache-spark#spark#scala-library

Stars1,950

Forks930

Last commit1 year ago

Mongo-Spark

Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.

#apache-spark#connector#spark

Stars730

Forks320

Last commit4 days ago

Storage

4 projects

Delta Lake

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8,925

Forks2,142

Last commit23 hours ago

Apache Hudi

An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.

#apache-flink#upsert-delete#stream-processing

Stars6,194

Forks2,495

Last commit18 hours ago

Apache Iceberg

A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.

#apache-flink#hacktoberfest#apache-spark

Stars9,075

Forks3,409

Last commit23 hours ago

lakeFS

docs.lakefs.io

Related Awesome Lists

📦

Public Datasets

The "Awesome Public Datasets" project is a curated collection of publicly available datasets across various domains, including government, healthcare, finance, and social sciences. This list features datasets in multiple formats, along with links to tools and platforms that facilitate data analysis and visualization. It is an invaluable resource for researchers, data scientists, and students looking to access high-quality data for their projects or studies. By providing a wide array of datasets, this collection empowers users to explore, analyze, and derive insights from real-world data. Dive in to discover the wealth of information available for your next data-driven endeavor!

73.8k

📦

Big Data

The "Awesome Big Data" project is a curated collection of resources focused on big data technologies and practices that enable the processing and analysis of vast amounts of data. This list encompasses a variety of categories, including frameworks, tools, libraries, databases, and tutorials that cater to both beginners and experienced data professionals. Users can explore resources related to data storage, processing, analytics, and visualization, making it an invaluable asset for data scientists, engineers, and researchers. Whether you're looking to enhance your big data skills or find the right tools for your projects, this collection provides a comprehensive guide to navigating the big data landscape.

14.3k

📦

Data Engineering

The "Awesome Data Engineering" project is a curated collection of resources aimed at supporting professionals in the field of data engineering, which involves the design and construction of systems for collecting, storing, and analyzing data. This list encompasses a variety of categories, including data pipelines, ETL tools, data warehousing solutions, frameworks, and best practices, as well as tutorials and community resources. Whether you are a beginner looking to understand the fundamentals or an experienced engineer seeking advanced techniques, this list offers valuable insights and tools to enhance your data engineering projects. Dive into this collection to discover the tools and methodologies that can streamline your data workflows and improve your data management capabilities.

8.5k

📦

Network Analysis

The "Awesome Network Analysis" project is a curated collection of resources focused on the study and analysis of networks, which are structures made up of interconnected elements. This list encompasses a variety of tools, libraries, datasets, and tutorials that facilitate the exploration of network theory, graph analysis, and visualization techniques. It serves as a valuable resource for researchers, data scientists, and enthusiasts interested in understanding complex systems, social networks, and data relationships. Whether you are a beginner looking to grasp the basics or an experienced analyst seeking advanced methodologies, this collection provides essential tools and insights to enhance your network analysis projects.

4.0k

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub