Spark

88 projects

Showing 36 of 88 projects

A comprehensive JVM-based deep learning ecosystem for building, training, and deploying models with support for model import and distributed training.

#distributed-training#intellij#spark-integration

Stars14.2k

Forks3.8k

Last commit

Pandas ProfilingPython

Generate comprehensive data quality profiles and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.

#spark#python-library#pandas-profiling

Stars13.7k

Forks1.8k

Last commit3 months ago

PredictionIOScala

An open source machine learning server for developers and data scientists, supporting event collection, algorithm deployment, and REST API queries.

#event-collection#spark#hbase

Stars12.5k

Forks1.9k

Last commit5 years ago

Delta LakeScala

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8.9k

Forks2.1k

Last commit10 hours ago

AlluxioJava

A distributed caching platform that bridges computation frameworks and storage systems for large-scale analytics and ML workloads.

#data-orchestration#spark#memory-speed

Stars7.2k

Forks2.9k

Last commit1 year ago

dev-setupPython

Automated scripts and instructions for setting up a comprehensive macOS development environment with tools for Python, web, data, and cloud development.

#developer-tools#spark#automation-scripts

Stars6.3k

Forks1.1k

Last commit3 years ago

Microsoft ML for Apache SparkScala

An open-source library for building massively scalable machine learning pipelines on Apache Spark.

#apache-spark#microsoft#spark

Stars5.2k

Forks863

Last commit17 days ago

spark-nlpScala

A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.

#apache-spark#spark#transformer-models

A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.

#apache-spark#java-library#compressed-bitmap

Stars3.9k

Forks594

Last commit15 hours ago

TensorFlowOnSparkPython

Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.

#apache-spark#yahoo#model-training

Stars3.8k

Forks939

Last commit3 years ago

deequScala

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks584

Last commit2 days ago

KoalasPython

Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

#apache-spark#spark#mlflow

Stars3.4k

Forks369

Last commit2 years ago

spark-jobserverScala

A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.

#apache-spark#spark#rest-api

Stars2.8k

Forks970

Last commit4 months ago

Apache KyuubiScala

A distributed, multi-tenant gateway providing serverless SQL on data warehouses and lakehouses.

#hiveserver2-alternative#hacktoberfest#spark

Stars2.4k

Forks1.0k

Last commit13 hours ago

.NET for Apache SparkC#

.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.

#apache-spark#spark#dataframe

Stars2.1k

Forks333

Last commit2 months ago

DataStax Spark Cassandra ConnectorScala

A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.

#apache-spark#spark#scala-library

Stars1.9k

Forks930

Last commit

Szilard's machine learning benchmarkR

A minimal benchmark comparing scalability, speed, and accuracy of popular open-source machine learning libraries for binary classification.

#h2o#random-forest#open-source

Stars1.9k

Forks327

Last commit3 years ago

GafferJava

A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.

#apache-spark#parquet#entity-relation

Stars1.8k

Forks363

Last commit1 year ago

GenieJava

A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.

#data-orchestration#spark#netflixoss

Stars1.8k

Forks375

Last commit10 days ago

ElephasPython

Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.

#apache-spark#model-training#spark

Stars1.6k

Forks303

Last commit3 years ago

mongo-hadoopJava

A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.

#mapreduce#bson#spark

Stars1.6k

Forks588

Last commit4 years ago

MLeapScala

MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.

#apache-spark#spark#production-ml

Stars1.5k

Forks317

Last commit2 days ago

OptimusPython

A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.

#data-cleaning#cudf#spark

Stars1.5k

Forks232

Last commit1 year ago

Project NessieJava

A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.

#version-control#iceberg#data-versioning

Stars1.5k

Forks183

Last commit9 hours ago

sparkmagicPython

Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.

#apache-spark#spark#notebook

Stars1.4k

Forks443

Last commit10 months ago

GraphFramesScala

A DataFrame-based graph processing library for Apache Spark, enabling scalable graph analytics and algorithms.

#graph-processing#apache-spark#network-motifs

Stars1.2k

Forks268

Last commit7 hours ago

HailPython

An open-source, Python-based data analysis tool with specialized data types and methods for genomic data at scale.

#scientific-computing#spark#python-library

Stars1.1k

Forks266

Last commit14 hours ago

ADAMScala

A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.

#genomic-data#apache-spark#parquet

Stars1.1k

Forks312

Last commit4 months ago

SnappydataScala

A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.

#stream-processing#in-memory-analytics#apache-spark

Stars1.0k

Forks198

Last commit3 years ago