A flexible and expressive API for performing statistical data validation on dataframe-like objects.
Pandera is a lightweight, flexible, and expressive statistical data testing library for validating dataframe-like objects. It provides an API to define schemas and checks that ensure data correctness in processing pipelines, supporting multiple dataframe libraries like pandas, polars, and pyspark. The framework helps scientists, engineers, and analysts catch data errors early by enforcing statistically typed dataframes.
Data scientists, data engineers, and analysts who work with dataframe-like objects in Python and need to ensure data quality and correctness in their pipelines. It's particularly useful for teams building robust data processing or machine learning workflows.
Developers choose Pandera for its expressive and flexible API that supports multiple dataframe libraries, making data validation readable and integrable into existing workflows. Its unique selling point is the combination of statistical typing with both object-based and class-based schema definitions, offering a more robust alternative to manual checks or less expressive validation tools.
A light-weight, flexible, and expressive statistical data testing library
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Validates across pandas, polars, and pyspark DataFrames, as shown in the README's installation and examples, making it versatile for diverse data workflows.
Offers both object-based (dictionary-like) and class-based (Python classes with type hints) schema definitions, catering to different coding preferences and enhancing readability.
Supports statistical constraints (e.g., ge, lt), custom lambda functions, and complex logic, enabling detailed data quality enforcement beyond basic type checking.
Backed by Union.ai with regular updates, badges show active CI, documentation, and adoption in pyOpenSci, ensuring reliability and community support.
The README warns of deprecation for top-level imports in v0.24.0, requiring code migration and posing maintenance overhead for existing users.
Validation adds computational cost, which can slow down pipelines with very large datasets or high-frequency checks, though benchmarking is provided via asv.
Requires installing extras (e.g., 'pandera[pandas]') for different libraries, increasing setup complexity and potential conflicts in constrained environments.