Generate comprehensive data quality profiles and exploratory data analysis reports for Pandas and Spark DataFrames with a single line of code.
ydata-profiling is an open-source Python library that generates detailed exploratory data analysis (EDA) and data quality profiling reports for Pandas and Spark DataFrames. It automates the initial data assessment process by providing comprehensive statistics, visualizations, and alerts about potential data issues, all with minimal code. The tool helps data scientists and analysts quickly understand dataset structure, identify problems, and share insights through interactive HTML reports.
Data scientists, data analysts, and machine learning engineers who work with Pandas or Spark DataFrames and need to perform rapid, automated exploratory data analysis and data quality checks.
Developers choose ydata-profiling because it drastically reduces the time and code required for initial data exploration, offering a one-line solution that goes beyond basic pandas summaries. Its automated detection of data issues, support for various data types (including time-series and text), and flexible output formats make it a comprehensive and efficient alternative to manual EDA scripting.
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Summarizes potential issues like missing data, skewness, and duplicates directly in the report, reducing manual inspection effort as highlighted in the 'Alerts' section.
Includes univariate statistics, multivariate correlations, time-series insights, and text profiling in a single report, extending far beyond pandas describe().
Exports to interactive HTML, JSON for automation, and embeds as widgets in Jupyter Notebooks, enabling easy sharing and integration into workflows.
Generates detailed profiles with minimal code, such as ProfileReport(df), making initial data assessment fast and consistent.
Can be slow and memory-intensive for very large datasets, even with Spark support labeled as a 'work in progress' in the README.
HTML reports require modern browsers, which may not be available in headless or server-side environments, limiting deployment options.
Features like PySpark or Unicode analysis need extra installations (e.g., via pip extras), adding configuration steps and potential dependency conflicts.
Extremely fast Query Engine for DataFrames, written in Rust
🎨 Python Echarts Plotting Library
Modin: Scale your Pandas workflows by changing a single line of code
cuDF - GPU DataFrame Library
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.