Generate comprehensive data quality profiling and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
ydata-profiling is a Python library that automates exploratory data analysis (EDA) and data quality assessment for datasets. It extends pandas' `df.describe()` to generate detailed, interactive reports with minimal code, helping users quickly understand data structure, detect issues, and visualize distributions. The library supports various data types including time-series and text, and exports reports in HTML, JSON, or as Jupyter widgets.
Data scientists, data analysts, and machine learning engineers who work with pandas DataFrames and need to perform rapid, automated EDA to assess data quality and generate shareable reports. It is also suitable for teams requiring consistent profiling across datasets or integrating EDA into pipelines.
Developers choose ydata-profiling for its one-line EDA experience that delivers comprehensive insights beyond basic statistics, including automatic type inference, quality warnings, and multivariate analysis. Its unique selling point is the ability to generate interactive, publication-ready reports with minimal effort, supporting advanced features like dataset comparison and Spark integration for scalability.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Automatically detects column data types like categorical, numerical, and date, reducing manual inspection effort as highlighted in the key features section.
Summarizes data issues such as missing values, skewness, and duplicates with warnings, providing actionable insights directly in the report.
Exports reports to HTML, JSON, or Jupyter widgets, enabling easy sharing and integration into various workflows, demonstrated in the quickstart examples.
Includes time-series and text analysis features, such as auto-correlation and common categories, extending beyond basic numerical profiling.
The README admits that profiling large datasets requires configuration tips, indicating it can be slow or memory-heavy without optimization, potentially limiting scalability.
Spark support is noted as needing additional help, suggesting it might be buggy or not fully production-ready for distributed computing environments.
Installation with extras like notebook or pyspark adds multiple dependencies, which can lead to bloated environments or conflicts in constrained setups.