Data Profiling

17 projects

Showing 17 of 17 projects

A unified open-source metadata platform for data discovery, observability, and governance with column-level lineage and team collaboration.

#data-collaboration#data-lineage#open-source

Stars14.5k

Forks2.2k

Last commit18 hours ago

Pandas ProfilingPython

Generate comprehensive data quality profiles and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.

#spark#python-library#pandas-profiling

Stars13.7k

Forks1.8k

Last commit3 months ago

YData ProfilingPython

Generate comprehensive data quality profiling and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.

#python-library#pandas-profiling#data-science

Stars13.7k

Forks1.8k

Last commit

Great ExpectationsPython

A Python library for data quality testing and validation using expressive, extensible Expectations.

#data-testing#datacleaning#open-source

An open-source data-centric AI library for automatically detecting and fixing data quality issues in machine learning datasets.

#data-cleaning#data-centric-ai#out-of-distribution-detection

A library built on Apache Spark for defining unit tests to measure data quality in large datasets.

#data-testing#apache-spark#spark

Stars3.6k

Forks584

Last commit3 days ago

SweetVizPython

A Python library for automated exploratory data analysis (EDA) with high-density visualizations and target analysis in two lines of code.

#statistical-analysis#data-science#automated-reporting

Stars3.1k

Forks288

Last commit3 months ago

VarietyJavaScript

A lightweight MongoDB schema analyzer that reveals document structure, field frequencies, and data outliers.

#bson#devops#schema-analyzer

Stars1.8k

Forks243

Last commit19 hours ago

Data ProfilerPython

A Python library that automatically extracts schema, statistics, and sensitive entities (PII/NPI) from datasets.

#sensitive-data-detection#data-labels#python-library

Stars1.6k

Forks187

Last commit3 days ago

python-deequJupyter Notebook

A Python API for Deequ, enabling data quality testing and validation on large datasets using Apache Spark.

#data-testing#apache-spark#python-api

Stars824

Forks156

Last commit3 days ago

DataExplorerR

An R package that automates exploratory data analysis and data treatment with one-line reports and visualizations.

#r-package#data-science#statistics

Stars545

Forks95

Last commit4 months ago

pandas_summaryPython

An engine for ML/data tracking, visualization, explainability, drift detection, and dashboards, integrated with Polyaxon.

#spark#matplotlib#data-science

Stars534

Forks47

Last commit1 month ago

Documentation website from Jupyter NotebookPython

A lightweight Python tool for generating rich summary statistics of pandas and Polars dataframes directly in the console.

#data-science#statistics#eda

Stars514

Forks29

Last commit4 days ago

desbordanteC++

A high-performance data profiler for discovering and validating complex patterns like functional dependencies, inclusion dependencies, and association rules.

#data-cleaning#pattern-discovery#data-science

A high-performance data profiler for discovering and validating complex patterns in datasets, enabling data cleaning and quality analysis.

#data-cleaning#cpp-library#data-science

Stars492

Forks101

Last commit5 days ago

DQOpsJava

A DataOps-friendly data quality monitoring platform with customizable checks, dashboards, and incident management for multiple data sources.

#data-quality-report#data-observability#data-quality-checks

Stars194

Forks37

Last commit6 months ago

MongoeyeGo

A fast schema and data analyzer for MongoDB that provides detailed insights into database structure and content.

#database-tool#statistics#schema

Stars173

Forks8

Last commit4 years ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub