A library built on Apache Spark for defining unit tests to measure data quality in large datasets.
Deequ is an open-source library built on Apache Spark that allows data teams to define and run unit tests for data quality at scale. It helps validate assumptions about large datasets—such as completeness, uniqueness, and value constraints—before the data is used in analytics or machine learning pipelines. By catching data errors early, it prevents downstream issues and ensures reliable data products.
Data engineers, data scientists, and analytics engineers working with large-scale data pipelines on Apache Spark who need to ensure data quality and reliability.
Developers choose Deequ because it provides a scalable, programmatic way to enforce data quality checks directly within Spark workflows, reducing manual validation efforts and catching data issues before they impact business decisions or models.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Built on Apache Spark, Deequ efficiently processes billions of rows across distributed systems, making it ideal for large-scale data pipelines as highlighted in the examples.
Offers a wide range of built-in checks for completeness, uniqueness, value ranges, and custom patterns, demonstrated in the basic example with checks like hasSize and isContainedIn.
Provides a Data Quality Definition Language for expressing rules in a simple, readable format, improving maintainability and reducing code verbosity, as shown in the DQDL examples.
Includes data profiling, anomaly detection, and a metrics repository for historical tracking, enabling proactive data quality monitoring beyond basic validation.
Requires Apache Spark setup and strict version compatibility (e.g., Deequ 2.x only with Spark 3.1), adding infrastructure overhead and limiting flexibility for non-Spark users.
Core library is in Scala/Java; PyDeequ provides a Python wrapper but may lag in features or require additional setup, as noted in the README's separate PyDeequ section.
Designed for batch data validation on Spark DataFrames; not suitable for real-time streaming use cases without significant workarounds or integration efforts.