Showing 23 of 23 projects
A comprehensive JVM-based deep learning ecosystem for building, training, and deploying models with support for model import and distributed training.
Generate comprehensive data quality profiles and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
An open source machine learning server for developers and data scientists, supporting event collection, algorithm deployment, and REST API queries.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
A distributed caching platform that bridges computation frameworks and storage systems for large-scale analytics and ML workloads.
Automated scripts and instructions for setting up a comprehensive macOS development environment with tools for Python, web, data, and cloud development.
An open-source library for building massively scalable machine learning pipelines on Apache Spark.
A state-of-the-art Natural Language Processing library built on Apache Spark, offering 100,000+ pretrained models and pipelines in 200+ languages.
Enables distributed TensorFlow training and inferencing on Apache Spark and Hadoop clusters with minimal code changes.
A compressed bitmap data structure for Java that outperforms alternatives like WAH, EWAH, and Concise in speed and compression.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Koalas provides the pandas DataFrame API on Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.
A RESTful job server for Apache Spark that provides a service interface for submitting and managing Spark jobs, jars, and contexts.
A distributed, multi-tenant gateway providing serverless SQL on data warehouses and lakehouses.
.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.
A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
A minimal benchmark comparing scalability, speed, and accuracy of popular open-source machine learning libraries for binary classification.
A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.
A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.
Elephas is a Keras extension for distributed deep learning on Apache Spark, enabling data-parallel training at scale.
A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.
MLeap is a portable execution engine for deploying machine learning pipelines from Spark and Scikit-learn without their runtime dependencies.
A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.