A Python library that automatically extracts schema, statistics, and sensitive entities (PII/NPI) from datasets.
DataProfiler is a Python library that automates data analysis, monitoring, and sensitive data detection. It automatically loads various data formats, extracts detailed statistics and schema, and identifies personally identifiable information (PII) using a pre-trained deep learning model. The library generates comprehensive profiles for structured, unstructured, and graph data to support downstream applications and improve data governance.
Data engineers, data scientists, and analysts who need to quickly profile datasets, monitor data quality, and detect sensitive information across CSV, JSON, Avro, Parquet, text, and other formats. It is also suitable for teams requiring automated data insights for compliance and governance.
Developers choose DataProfiler for its automated, all-in-one profiling that combines statistical analysis with built-in sensitive data detection, reducing manual effort. Its ability to update, merge, and diff profiles supports incremental and distributed analysis, making it versatile for ongoing data monitoring.
What's in your data? Extract schema, statistics and entities from datasets
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports CSV, JSON, Avro, Parquet, text, URLs, and Pandas DataFrames with a single Data() command, eliminating manual parsing for common data types.
Includes a pre-trained deep learning model to automatically identify sensitive entities like credit cards and emails, reducing manual effort for compliance and data governance.
Allows updating, merging, and diffing profiles, enabling incremental analysis and distributed processing for large or evolving datasets.
Generates structured, unstructured (text), and graph profiles tailored to different data types, providing comprehensive insights beyond tabular data.
The full installation requires TensorFlow and other ML packages, which can be bulky, cause version conflicts, and are disabled in slimmer installs, limiting functionality.
Adding new PII entities or modifying the detection model involves extending pre-trained pipelines or inserting new ones, requiring deep learning expertise and code changes.
Profiling with deep learning models relies on sampling and can be slow on large datasets, and features like correlation matrices are toggled off by default, indicating potential bottlenecks.