A Python library for comparing Pandas, Polars, Spark, and Snowpark DataFrames with detailed reporting and flexible matching.
DataComPy is a Python library for comparing DataFrames across multiple backends like Pandas, Polars, Spark, and Snowpark. It solves the problem of validating data consistency and quality by providing detailed mismatch reports and configurable comparison logic, going beyond basic equality checks. It serves as a modern, open-source alternative to proprietary data comparison tools.
Data engineers, data scientists, and analysts working with Python data ecosystems who need to validate, reconcile, or audit datasets across different processing frameworks or environments.
Developers choose DataComPy for its multi-backend support, detailed reporting, and flexibility in matching criteria, offering a more informative and adaptable solution than native DataFrame equality methods or limited proprietary alternatives.
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports multiple data processing frameworks like Pandas, Polars, Spark, and Snowflake through Fugue integration, enabling unified comparisons across diverse environments as highlighted in the README.
Generates comprehensive statistics and specific mismatch details, providing actionable insights beyond basic equality checks, which is core to its value proposition.
Allows adjustments for tolerances in numeric comparisons and column matching criteria, essential for real-world data validation with slight variations.
Designed as an open-source alternative to SAS's PROC COMPARE, easing migration for teams moving from proprietary systems to Python ecosystems.
Installing support for backends like Spark or Snowflake requires extra pip installs with extras, and compatibility matrices show limitations with Python 3.12 and newer Pandas versions, adding setup overhead.
The detailed reporting and abstraction layers, especially through Fugue, may introduce performance penalties compared to native backend-specific comparisons, particularly for large datasets.
The project is in transition to v1 with a support branch for older versions, indicating potential breaking changes and less stability for current users, as noted in the README.
datacompy is an open-source alternative to the following products: