Apache Spark

93 projects

Showing 36 of 93 projects

An open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing production LLM applications and machine learning models.

#ai-gateway#apache-spark#ai

Stars27.2k

Forks6.0k

Last commit18 hours ago

Apache IcebergJava

A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.

#apache-flink#hacktoberfest#apache-spark

Stars9.1k

Forks3.4k

Last commit22 hours ago

CatBoostC++

A high-performance gradient boosting library with best-in-class handling of categorical features and support for CPU/GPU training.

#apache-spark#gbdt#python-library

Stars9.0k

Forks1.3k

Last commit2 days ago

Delta LakeScala

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8.9k

Forks2.1k

Last commit23 hours ago

Apache HudiJava

An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.

#apache-flink#upsert-delete#stream-processing

Stars6.2k

Forks2.5k

Last commit18 hours ago

Microsoft ML for Apache SparkScala

An open-source library for building massively scalable machine learning pipelines on Apache Spark.

#apache-spark#microsoft#spark

Stars5.2k

Forks863

Last commit18 days ago

spark-nlpScala

A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.

#apache-spark#spark#transformer-models

Stars4.1k

Forks743

Last commit2 days ago

Hunting ELK (HELK)Jupyter Notebook

An open-source threat hunting platform with advanced analytics capabilities built on ELK stack, Apache Spark, and Jupyter notebooks.

#apache-spark#elk-stack#security-analytics

Stars3.9k

Forks690

Last commit2 years ago

RoaringBitmapJava

A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.

#apache-spark#java-library#compressed-bitmap

Stars3.9k

Forks594

Last commit1 day ago

TensorFlowOnSparkPython

Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.

#apache-spark#yahoo#model-training

Stars3.8k

Forks939

Last commit3 years ago

deequScala

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks584

Last commit3 days ago

KoalasPython

Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

#apache-spark#spark#mlflow

Stars3.4k

Forks369

Last commit2 years ago

spark-jobserverScala

A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.

#apache-spark#spark#rest-api

Stars2.8k

Forks970

Last commit4 months ago

Apache SedonaJava

A cluster computing framework for processing large-scale geospatial data within Apache Spark, Flink, and other big data systems.

#apache-flink#hacktoberfest#apache-spark

Stars2.4k

Forks773

Last commit1 day ago

.NET for Apache SparkC#

.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.

#apache-spark#spark#dataframe

Stars2.1k

Forks333

Last commit2 months ago

Elasticsearch HadoopJava

Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.

#apache-spark#mapreduce#data-integration

Stars2.0k

Forks1.0k

Last commit1 day ago

DataStax Spark Cassandra ConnectorScala

A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.

#apache-spark#spark#scala-library

A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.

#apache-spark#data-science#spark-ecosystem

Stars1.9k

Forks346

Last commit4 months ago

GafferJava

A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.

#apache-spark#parquet#entity-relation

Stars1.8k

Forks363

Last commit1 year ago

Oryx 2Java

A lambda architecture framework on Apache Spark and Kafka for building and deploying real-time large-scale machine learning applications.

#apache-spark#large-scale-ml#classification

Stars1.8k

Forks401

Last commit5 years ago

ElephasPython

Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.

#apache-spark#model-training#spark

Stars1.6k

Forks303

Last commit3 years ago

spark-testing-baseScala

Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.

#apache-spark#unit-testing#integration-testing

Stars1.6k

Forks356

Last commit3 months ago

MLeapScala

MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.

#apache-spark#spark#production-ml

Stars1.5k

Forks317

Last commit3 days ago

HiBenchJava

A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.

#apache-spark#performance-testing#distributed-systems

Stars1.5k

Forks766

Last commit7 months ago

sparkmagicPython

Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.

#apache-spark#spark#notebook

Stars1.4k

Forks443

Last commit10 months ago

GraphFramesScala

A DataFrame-based graph processing library for Apache Spark, enabling scalable graph analytics and algorithms.

#graph-processing#apache-spark#network-motifs

Stars1.2k

Forks268

Last commit19 hours ago

Sparkit-learnPython

PySpark + Scikit-learn = Sparkit-learn

#apache-spark#python#scikit-learn

Stars1.2k

Forks254

Last commit5 years ago

SystemMLJava

An open-source machine learning system for the end-to-end data science lifecycle from data preparation to model serving.

#federated-learning#apache-spark#data-science

A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.

#genomic-data#apache-spark#parquet

Stars1.1k

Forks312

Last commit4 months ago

SnappydataScala

A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.

#stream-processing#in-memory-analytics#apache-spark

Stars1.0k

Forks198

Last commit3 years ago