Showing 18 of 18 projects
Generate comprehensive data quality profiling and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
A portable Python dataframe library that compiles to SQL and works with over 20 backends for unified data manipulation.
An open-source library for building massively scalable machine learning pipelines on Apache Spark.
A flexible and expressive API for performing statistical data validation on dataframe-like objects.
A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.
A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.
A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.
Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.
A PySpark testing library providing fast helper methods with descriptive, color-coded error messages for DataFrame and column comparisons.
A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.
A quick reference guide to the most commonly used patterns and functions in PySpark SQL.
A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.
A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.
A Python library that provides a Pandas-like API on top of Apache Spark DataFrames for distributed data analysis.
A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.
An open-source toolkit for analyzing web archives at scale using Apache Spark.
A Docker container providing a complete streaming environment for experimenting with Kafka, Spark Streaming, and Cassandra.
A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.