Spark

88 projects

Showing 36 of 88 projects

A large-scale data warehouse system that provides approximate query answers with error bounds on massive datasets up to 300x faster than Hive.

#spark#sampling#performance-optimization

Stars660

Forks121

Last commit12 years ago

datacompyPython

A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.

#apache-spark#fugue#spark

Stars654

Forks162

Last commit2 days ago

Wiki2Vec. Getting Word2vec vectors for entities and word from Wikipedia DumpsJava

Generate Word2Vec vectors for DBpedia entities from Wikipedia dumps, linking words and topics to structured knowledge.

#semantic-analysis#word2vec#entity-embeddings

An engine for ML/data tracking, visualization, explainability, drift detection, and dashboards, integrated with Polyaxon.

#spark#matplotlib#data-science

Stars534

Forks47

Last commit1 month ago

Kotlin for Apache SparkKotlin

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

spark-fast-testsScala

A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.

#apache-spark#spark#unit-testing

Stars457

Forks77

Last commit3 months ago

sparkleHaskell

A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.

#haskell#apache-spark#functional-programming

Stars449

Forks27

Last commit11 months ago

spark.fishShell

A Fish shell plugin for generating sparklines in the terminal with improved performance and additional flags.

#developer-tools#open-source#spark

Stars378

Forks6

Last commit5 years ago

DelightScala

A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.

#apache-spark#spark#delight

Stars345

Forks58

Last commit2 years ago

Hydrosphere MistScala

A serverless proxy for Spark clusters that provides a functional programming framework and deployment model for Spark applications.

#apache-spark#api#spark

Stars325

Forks69

Last commit3 months ago

neo4j-spark-connectorScala

A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.

#hacktoberfest#apache-spark#neo4j-driver

Stars322

Forks119

Last commit20 hours ago

ada-language-serverAda

A language server implementing the Microsoft Language Server Protocol for Ada, SPARK, and GPR project files.

#libadalang#spark#gpr

Stars300

Forks70

Last commit2 days ago

GeniClojure

An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.

#apache-spark#high-performance-computing#spark

Stars294

Forks26

Last commit2 years ago

SparkC#

An open-source FHIR server developed in C#, supporting multiple FHIR versions for healthcare data interoperability.

#nuget#stu3#spark

Stars280

Forks167

Last commit6 hours ago

isolation-forestScala

A distributed Spark/Scala implementation of Isolation Forest and Extended Isolation Forest algorithms for scalable unsupervised outlier detection.

#apache-spark#spark#linkedin

Stars260

Forks54

Last commit1 month ago

ferryPython

Define, run, and deploy big data applications on AWS, OpenStack, and local machines using Docker.

#devops#spark#data-science

Stars254

Forks25

Last commit11 years ago

ruby-sparkRuby

A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.

#rdd#apache-spark#distributed

Stars226

Forks28

Last commit9 years ago

LiFTScala

A Scala/Spark library for measuring fairness and mitigating bias in large-scale machine learning workflows.

#fairness-ml#apache-spark#spark

Stars173

Forks22

Last commit7 months ago

ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives

Stars161

Forks19

Last commit9 months ago

Archives Unleashed ToolkitScala

An open-source toolkit for analyzing web archives at scale using Apache Spark.

#apache-spark#web-archives#cultural-heritage

Stars158

Forks33

Last commit7 months ago

record-fluxAda

A toolset for formal specification and generation of verifiable binary parsers, message generators, and protocol state machines.

#binary-parser#state-machines#spark

Stars129

Forks10

Last commit6 months ago

spark-connect-rsRust

An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.

#spark-connect#apache-spark#spark

Stars116

Forks24

Last commit1 year ago

dl4cljClojure

A Clojure wrapper for Deeplearning4j, providing idiomatic access to neural networks, data import, and distributed training.

#spark#wrapper-library#data-science

Stars99

Forks18

Last commit8 years ago

iLIDPython

A deep learning system for automatic spoken language identification from audio files using TensorFlow and Caffe.

#spark#deep-learning#neural-networks

Stars90

Forks24

Last commit7 years ago

cubitAda

A multi-processor, 64-bit, formally-verified general-purpose operating system for x86-64, written in SPARK/Ada.

#spark#memory-management#ada

Stars88

Forks4

Last commit2 months ago

Docker for beginnersJupyter Notebook

A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.

#google-colab#mapreduce-bash#apache-spark

Stars87

Forks27

Last commit2 months ago

Scylla-MigratorScala

A Spark application for migrating data to ScyllaDB from CQL-compatible databases or DynamoDB via Alternator.

#apache-spark#parquet#migration

Stars73

Forks50

Last commit7 days ago

itachiScala

A library that brings useful functions from various modern database management systems to Apache Spark

#postgres#spark#trino

Stars63

Forks8

Last commit2 years ago

sparklyPython

Helpers & syntactic sugar for PySpark.

#spark#python#pyspark

Stars62

Forks9

Last commit7 months ago

Map/Reduce implementations of common ML algorithmsJupyter Notebook

Jupyter notebooks for hands-on Big Data Analytics exercise classes covering Spark ML, Map/Reduce algorithms, and deep learning.

#spark#educational#data-science

A Python library providing tools to process and analyze OMOP-standardized clinical data from AP-HP's Clinical Data Warehouse.

#clinical-data#spark#data-science

Stars45

Forks6

Last commit1 year ago

certiflieAda

Ada and SPARK firmware for the Crazyflie 2.0 nano quadcopter, targeting the STM32F4 ARM chip.

#embedded-systems#spark#arm

Stars36

Forks18

Last commit7 years ago

ada-traits-containersAda

A flexible Ada library offering generic containers and algorithms with SPARK compatibility and performance control.

#spark#memory-management#graph-algorithms

Stars35

Forks14

Last commit1 year ago

havkAda

A minimalistic, security-focused x86-64 operating system kernel written in Ada/SPARK with formal verification.

#spark#uefi#ada

Stars29

Forks3

Last commit5 years ago

Archives Unleashed NotebooksJupyter Notebook

Example notebooks for analyzing web archives using the Archives Unleashed Toolkit.

#web-archives#spark#pyspark-notebook

Stars26

Forks5

Last commit3 years ago

ada-actionsJavaScript

GitHub Action to set up Ada and SPARK development environments for CI/CD workflows.

#embedded-systems#spark#ada

Stars24

Forks6

Last commit4 years ago

PreviousPage 2 of 3Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub