Showing 36 of 78 projects
An open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing production LLM applications and machine learning models.
A high-performance gradient boosting library with best-in-class handling of categorical features and support for CPU/GPU training.
A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
An open-source library for building massively scalable machine learning pipelines on Apache Spark.
A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.
An open-source threat hunting platform with advanced analytics capabilities built on ELK stack, Apache Spark, and Jupyter notebooks.
A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.
Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.
A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.
A cluster computing framework for processing large-scale geospatial data within Apache Spark, Flink, and other big data systems.
.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.
Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.
A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.
A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.
A lambda architecture framework on Apache Spark and Kafka for building and deploying real-time large-scale machine learning applications.
Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.
Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.
MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.
A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.
Jupyter magics and kernels for interactively working with remote Spark clusters via Livy, Lighter, or Ilum.
A DataFrame-based graph processing library for Apache Spark, enabling scalable graph analytics and algorithms.
An open-source machine learning system for the end-to-end data science lifecycle from data preparation to model serving.
A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.
A distributed, in-memory optimized analytics database that fuses Apache Spark and Apache Geode for unified stream, transaction, and analytic workloads.
An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.
A REST interface for interacting with Apache Spark from anywhere, enabling remote code execution and job submissions.
C# and F# language binding and extensions for Apache Spark, enabling .NET developers to write Spark driver programs and data processing operations.
A Python API for Deequ, enabling data quality testing and validation on large datasets using Apache Spark.
A scalable machine learning library for training Generalized Linear Models and GLMix models on Apache Spark.
A PySpark testing library providing fast helper methods with descriptive, color-coded error messages for DataFrame and column comparisons.
A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.