Python Data Analysis-Exploratory Data Analysis (EDA) in Pandas

Pandas Profiling- Get your hands clean with dirty data

Prabhakar Pandey
5 min readSep 19, 2020

Know Your Data’s Power using Pandas Profiling

Exploratory Data Analysis (EDA) in Pandas
Exploratory Data Analysis (EDA) in Pandas

When we get a new data set, the first thing we do is to get an understanding of the data. We do basic data analysis with data using Pandas or NumPy lib, In steps we understand pattern in our data before doing more elaborate analyses such as customized EDA or modeling. We determine number of unique values, identifying the data type, as well as percentage of missing values for each variable.

With huge amount of data we think, how to do simple, fast and yet very powerful exploratory data analysis (EDA) then we come around Pandas Profiling Package to get a simple fast powerful data analysis within a second to know the power of our data.

When we talk about the Exploratory Data Analysis (EDA), its play a vital role to understanding the datasets. If you are going to build a Machine Learning Model or wants to bring out some info insights from the data, EDA is the first step task to perform.

Steps to perform EDA- Python Pandas Profiling:

How to install pandas-profiling

Option 1: Using pip

You can install using the pip package manager by running

Pip install pandas-profiling

Option 2: Using conda

conda install -c conda-forge pandas-profiling

How to Use pandas-profiling

Once we have installed the package, Now we need to import the package in the environment

import pandas_profiling

Keep in Mind:

Pandas-profiling generates profile reports (.html or other extensions) from a pandas DataFrame. As we know, the pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.We get a great visual handy report to see the dossier about our data set.

pandas_profiling Snippet
pandas_profiling Snippet

Explanation: Just for doing the analysis, we import the pandas_profiling and pandas lib, since we analyzing ‘loansdata.csv’ so reading this file and generating the analysis profile report which is in .html.

Note: display() function, display the same output in the window instead to generate a file, it uses when we work with jupyter.

For large datasets, the analysis can run out of memory. In that case, it is useful to disable the correlation analysis as below:

profile = pandas_profiling.ProfileReport(df, check_correlation = False)

We have generated a profile summary report (“DataAnalysisProfile.html “) for a returned data set. Most things can capture in this report for EDA perspective or understanding about the data for further exploration.

In the Profile report:” We will have five-section (all are clickable in the report) as below explanation.

Overview: The overview section provides overall data set information. This section has 2 sub-sections namely ‘Dataset info’ and ‘Variables types’.

Datasets info sub-section displays several variables (columns), several observations (rows), missing cells, duplicate rows, total size, etc.

Variables Types sub-section displays types of features like how many features are of numeric type, how many are of categorical type, boolean, date, URL, text (Unique), rejected, unsupported. Besides, it even displays ‘Warnings’ where it gives which feature(s) are highly correlated to others and which have a maximum percentage of 0s.

pandas_profiling Overview
pandas_profiling Overview

Variables: Variables section provides the information of every feature individually unlike Overview sections which provides information on the whole data set. It provides information like unique points and its percentage; missing values and its percentage. Also, as we can see on the right side, it gives a minimum and maximum values, and the percentage of zeros in that feature.

pandas_profiling Variables
pandas_profiling Variables

If we click on the Toggle details option as shown in the above image, the new section shows up.

Correlations: Correlation section provides a visualization of how features are correlated to each other with seaborn’s heatmap. We can have a clear and easy understanding of how features are correlated with each other. Referring to the highlight in the above image (Correlation section), we can easily toggle between different correlations like Pearson, Spearman, Kendall, and phik.

pandas_profiling Correlations
pandas_profiling Correlations

Missing Values: This section provides different graphs ‘Matrix’, ‘Count’, Heatmap etc.

In the Matrix graph, we can visualize missing values. From the left graph, we can conclude that there are no missing values.

In the Count graph, we can visualize the count of data points in each feature. From the left graph, we can conclude that all the features have the same count of data points.

pandas_profiling Missing Values
pandas_profiling Missing Values

Sample: This section displays 1st 10 data points (head of 10) and the bottom 10 data points (tail of 10).

pandas_profiling Sample First 10 Rows
pandas_profiling Sample First 10 Rows
pandas_profiling Sample Last 10 Rows
pandas_profiling Sample Last 10 Rows

Keep in Mind:

Applying all these conditions becomes a tedious task for EDA but using pandas profiling its apply within a second to give you a platter of data analysis, but remember it’s all applied by a set of rules — like plot boxplot and histogram for a continuous variable, Measures missing values, Calculate frequency if it’s a categorical variable — thus giving us opportunity to automate things. That’s the base of this python module pandas_profiling that helps us to automate the first-level of EDA of dataset.

If you find it useful, do clap and share it among your enthusiastic peers.

More to come! Stay tuned, and thanks for reading :)

Happy Learning !!

--

--

Prabhakar Pandey
Prabhakar Pandey

Written by Prabhakar Pandey

Data Geek | Writer - Blogger | Tripper

Responses (7)