Showing 30 of 30 projects
An orchestration platform for developing, deploying, and monitoring data pipelines and assets.
A unified open-source metadata platform for data discovery, observability, and governance with column-level lineage and team collaboration.
Generate comprehensive data quality profiles and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
Generate comprehensive data quality profiling and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
A transformation workflow that enables data teams to transform data in their warehouse using SQL and software engineering best practices.
A transformation tool that enables data analysts and engineers to transform data using software engineering best practices.
A Python library for data quality testing and validation using expressive, extensible Expectations.
An open-source data-centric AI library for automatically detecting and fixing data quality issues in machine learning datasets.
An open-source Python framework to evaluate, test, and monitor ML and LLM systems with 100+ built-in metrics.
An open-source feature store for managing and serving machine learning features for training and online inference.
An open-source tool that transforms object storage into a Git-like repository for versioned, atomic, and repeatable data lake operations.
A Python library using machine learning for accurate and scalable fuzzy matching, record deduplication, and entity resolution on structured data.
A flexible and expressive API for performing statistical data validation on dataframe-like objects.
A Python library for visualizing missing data in pandas DataFrames using matrix, bar, heatmap, and dendrogram plots.
A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Fast tool for comparing datasets within or across SQL databases to identify differences.
Automatically visualize any dataset with a single line of code, including data quality assessment and fixes.
A unified data pipeline tool for ingestion, transformation with SQL/Python/R, and data quality checks across major platforms.
A Go library for email verification without sending emails, featuring syntax validation, SMTP checks, disposable email detection, and domain typo suggestions.
A Python library that automatically extracts schema, statistics, and sensitive entities (PII/NPI) from datasets.
A canonical index of common brand names, operators, and features for consistent tagging in OpenStreetMap.
A Python API for Deequ, enabling data quality testing and validation on large datasets using Apache Spark.
A scalable library for exploring, validating, and monitoring machine learning data, integrated with TensorFlow and TFX.
A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.
A high-performance data profiler for discovering and validating complex patterns like functional dependencies, inclusion dependencies, and association rules.
A high-performance data profiler for discovering and validating complex patterns in datasets, enabling data cleaning and quality analysis.
A Python data validation toolkit that finds data quality issues and generates beautiful, shareable reports for team collaboration.
A complete, fast, standards-based validation tool for GeoJSON data.
A Go library and CLI tool for validating CSV files against RFC 4180 standards.
A DataOps-friendly data quality monitoring platform with customizable checks, dashboards, and incident management for multiple data sources.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.