Pyspark

21 projects

Showing 21 of 21 projects

Generate comprehensive data quality profiling and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.

#python-library#pandas-profiling#data-science

A portable Python dataframe library that compiles to SQL and works with over 20 backends for unified data manipulation.

#database#python-dataframe#sql-compilation

Stars6.6k

Forks745

Last commit18 hours ago

Microsoft ML for Apache SparkScala

An open-source library for building massively scalable machine learning pipelines on Apache Spark.

#apache-spark#microsoft#spark

Stars5.2k

Forks863

Last commit17 days ago

panderaPython

A flexible and expressive API for performing statistical data validation on dataframe-like objects.

#data-cleaning#pandas-validation#python-library

Stars4.4k

Forks421

Last commit6 days ago

spark-nlpScala

A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.

#apache-spark#spark#transformer-models

Stars4.1k

Forks743

Last commit2 days ago

Apache SparkShell

A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.

#apache-spark#data-science#spark-ecosystem

Stars1.9k

Forks346

Last commit4 months ago

OptimusPython

A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.

#data-cleaning#cudf#spark

Stars1.5k

Forks232

Last commit1 year ago

sparkmagicPython

Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.

#apache-spark#spark#notebook

Stars1.4k

Forks443

Last commit10 months ago

Sparkling WaterScala

Sparkling Water provides H2O functionality inside Spark cluster

#h2o#spark#pysparkling

Stars979

Forks361

Last commit8 months ago

chispaPython

A PySpark testing library providing fast helper methods with descriptive, color-coded error messages for DataFrame and column comparisons.

#apache-spark#unit-testing#dataframe

Stars771

Forks80

Last commit12 days ago

PySpark Cheatsheet

A quick reference guide to the most commonly used patterns and functions in PySpark SQL.

#apache-spark#reference-guide#data-science

Stars696

Forks211

Last commit3 years ago

Carefully Curated 70 Spark Questions with Additional Optimization Guides (First in the series)

A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.

#apache-spark#spark#performance-optimization

Stars691

Forks80

Last commit4 years ago

quinnPython

A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.

#dataframe-utilities#apache-spark#spark-extensions

Stars687

Forks95

Last commit1 month ago

datacompyPython

A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.

#apache-spark#fugue#spark

Stars654

Forks162

Last commit3 days ago

SparklingPandasPython

A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.

#apache-spark#dataframe#python

Stars361

Forks79

Last commit3 years ago

Joblib Apache Spark BackendPython

A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.

#apache-spark#parallel-computing#joblib

Stars250

Forks24

Last commit4 months ago

Archives Unleashed ToolkitScala

An open-source toolkit for analyzing web archives at scale using Apache Spark.

#apache-spark#web-archives#cultural-heritage

Stars158

Forks33

Last commit7 months ago

kafka-sparkstreaming-cassandraJupyter Notebook

A Docker container providing a complete streaming environment for experimenting with Kafka, Spark Streaming, and Cassandra.

#apache-spark#experimentation#real-time-processing

Stars96

Forks59

Last commit

Docker for beginnersJupyter Notebook

A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.

#google-colab#mapreduce-bash#apache-spark

Stars87

Forks27

Last commit2 months ago

sparklyPython

Helpers & syntactic sugar for PySpark.

#spark#python#pyspark

Stars62

Forks9

Last commit7 months ago

Tweet Archvies Unleashed ToolkitScala

An open-source toolkit for analyzing line-oriented JSON Twitter archives using Apache Spark.

#twitter-analysis#apache-spark#spark

Stars10

Forks2

Last commit4 months ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub