Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Python
  3. desbordante

desbordante

AGPL-3.0C++v2.4.1

A high-performance data profiler for discovering and validating complex patterns like functional dependencies, inclusion dependencies, and association rules.

GitHubGitHub
482 stars100 forks0 contributors

What is desbordante?

Desbordante is a high-performance data profiler that discovers and validates complex patterns in datasets, such as functional dependencies, inclusion dependencies, and association rules. It helps users uncover hidden relationships, improve data quality, and prepare data for analysis by identifying errors, duplicates, and integrity constraints.

Target Audience

Data scientists, data engineers, and researchers who need to perform deep data profiling, ensure data quality, or explore datasets for scientific or business insights. It is also suitable for database administrators looking to recover schema constraints.

Value Proposition

Developers choose Desbordante for its extensive support of over 20 pattern types, high-performance dynamic algorithms, and flexible interfaces (console, Python, web). Its ability to explain validation failures and support real-world data cleaning scenarios sets it apart from simpler profiling tools.

Overview

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

Use Cases

Best For

  • Discovering functional dependencies to identify primary and foreign keys in databases
  • Validating data integrity constraints for data quality assurance
  • Performing data deduplication and typo detection in messy datasets
  • Exploring scientific datasets to formulate hypotheses based on discovered patterns
  • Preparing training data for machine learning through feature engineering and anomaly detection
  • Interactive data profiling and visualization via its web application

Not Ideal For

  • Projects requiring only basic data quality checks like duplicate detection or format validation without complex pattern analysis
  • Teams that prioritize a fully-featured, interactive web GUI for all data profiling tasks
  • Environments with outdated system dependencies or incompatible compiler versions (e.g., older than GCC 10 or Boost 1.85)
  • Users seeking plug-and-play data profiling with minimal configuration and immediate out-of-the-box usage

Pros & Cons

Pros

Extensive Pattern Library

Supports over 20 pattern types, including exact/approximate functional dependencies, inclusion dependencies, and association rules, with linked Colab notebooks for each, enabling deep data exploration.

High-Performance Dynamic Algorithms

Offers dynamic validation that incrementally updates results after data changes, providing up to several orders of magnitude speed improvements over static recomputation, as highlighted in the task definitions.

Flexible Multi-Interface Support

Provides console, Python bindings with pandas DataFrame integration, and a web application, allowing adaptation to various workflows, though the web app is limited in scope.

Detailed Validation Explanations

Validation tasks return not just true/false but also explanations like conflicting rows or values, aiding in debugging data quality issues, as emphasized in the pattern descriptions.

Cons

Limited Web Application

The web interface currently supports only a limited number of patterns and is described as more of an interactive demo, with time and memory limits enforced in the deployed version.

Complex Installation and Dependencies

Pip installation may fail on unsupported systems, requiring manual building with specific compiler versions (e.g., GCC 10+) and Boost libraries, which can be cumbersome and error-prone.

Steep Learning Curve

Patterns are based on academic research, and the README admits a lack of comprehensive guides, directing users to research papers for understanding, which may deter non-expert users.

Frequently Asked Questions

Quick Stats

Stars482
Forks100
Contributors0
Open Issues34
Last commit3 days ago
CreatedSince 2020

Tags

#data-cleaning#data-science#data-wrangling#data-engineering#c-plus-plus#data-profiling#data-quality#anomaly-detection#data-cleansing#data-exploration#data-preprocessing#python-bindings#data-analytics#data-mining

Built With

p
pandas
C
CMake
P
Python
s
spdlog
P
Pybind11
B
Boost
C
C++

Included in

Python290.8k
Auto-fetched 1 day ago

Related Projects

openbbopenbb

Financial data platform for analysts, quants and AI agents.

Stars68,285
Forks6,876
Last commit2 days ago
PathwayPathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

Stars63,179
Forks1,673
Last commit2 days ago
pandaspandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Stars48,879
Forks19,964
Last commit1 day ago
polarspolars

Extremely fast Query Engine for DataFrames, written in Rust

Stars38,630
Forks2,858
Last commit2 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub