Showing 23 of 23 projects
An open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing production LLM applications and machine learning models.
A high-performance gradient boosting library with best-in-class handling of categorical features and support for CPU/GPU training.
A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
An open-source library for building massively scalable machine learning pipelines on Apache Spark.
A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.
An open-source threat hunting platform with advanced analytics capabilities built on ELK stack, Apache Spark, and Jupyter notebooks.
Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.
A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.
A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.
A cluster computing framework for processing large-scale geospatial data within Apache Spark, Flink, and other big data systems.
.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.
Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.
A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.
A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.
A lambda architecture framework on Apache Spark and Kafka for building and deploying real-time large-scale machine learning applications.
Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.
Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.
MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.