Python Data Analysis-Exploratory Data Analysis (EDA) in Pandas
Pandas Profiling- Get your hands clean with dirty data
Know Your Data’s Power using Pandas Profiling
When we get a new data set, the first thing we do is to get an understanding of the data. We do basic data analysis with data using Pandas or NumPy lib, In steps we understand pattern in our data before doing more elaborate analyses such as customized EDA or modeling. We determine number of unique values, identifying the data type, as well as percentage of missing values for each variable.
With huge amount of data we think, how to do simple, fast and yet very powerful exploratory data analysis (EDA) then we come around Pandas Profiling Package to get a simple fast powerful data analysis within a second to know the power of our data.
When we talk about the Exploratory Data Analysis (EDA), its play a vital role to understanding the datasets. If you are going to build a Machine Learning Model or wants to bring out some info insights from the data, EDA is the first step task to perform.
Steps to perform EDA- Python Pandas Profiling:
How to install pandas-profiling
Option 1: Using pip
You can install using the pip package manager by running
Pip install pandas-profiling
Option 2: Using conda
conda install -c conda-forge pandas-profiling
How to Use pandas-profiling
Once we have installed the package, Now we need to import the package in the environment
import pandas_profiling
Keep in Mind:
Pandas-profiling generates profile reports (.html or other extensions) from a pandas
DataFrame
. As we know, the pandasdf.describe()
function is great but a little basic for serious exploratory data analysis.pandas_profiling
extends the pandas DataFrame withdf.profile_report()
for quick data analysis.We get a great visual handy report to see the dossier about our data set.
Explanation: Just for doing the analysis, we import the pandas_profiling and pandas lib, since we analyzing ‘loansdata.csv’ so reading this file and generating the analysis profile report which is in .html.
Note: display() function, display the same output in the window instead to generate a file, it uses when we work with jupyter.
For large datasets, the analysis can run out of memory. In that case, it is useful to disable the correlation analysis as below:
profile = pandas_profiling.ProfileReport(df, check_correlation = False)
We have generated a profile summary report (“DataAnalysisProfile.html “) for a returned data set. Most things can capture in this report for EDA perspective or understanding about the data for further exploration.
In the Profile report:” We will have five-section (all are clickable in the report) as below explanation.
Overview: The overview section provides overall data set information. This section has 2 sub-sections namely ‘Dataset info’ and ‘Variables types’.
Datasets info sub-section displays several variables (columns), several observations (rows), missing cells, duplicate rows, total size, etc.
Variables Types sub-section displays types of features like how many features are of numeric type, how many are of categorical type, boolean, date, URL, text (Unique), rejected, unsupported. Besides, it even displays ‘Warnings’ where it gives which feature(s) are highly correlated to others and which have a maximum percentage of 0s.
Variables: Variables section provides the information of every feature individually unlike Overview sections which provides information on the whole data set. It provides information like unique points and its percentage; missing values and its percentage. Also, as we can see on the right side, it gives a minimum and maximum values, and the percentage of zeros in that feature.
If we click on the Toggle details option as shown in the above image, the new section shows up.
Correlations: Correlation section provides a visualization of how features are correlated to each other with seaborn’s heatmap. We can have a clear and easy understanding of how features are correlated with each other. Referring to the highlight in the above image (Correlation section), we can easily toggle between different correlations like Pearson, Spearman, Kendall, and phik.
Missing Values: This section provides different graphs ‘Matrix’, ‘Count’, Heatmap etc.
In the Matrix graph, we can visualize missing values. From the left graph, we can conclude that there are no missing values.
In the Count graph, we can visualize the count of data points in each feature. From the left graph, we can conclude that all the features have the same count of data points.
Sample: This section displays 1st 10 data points (head of 10) and the bottom 10 data points (tail of 10).
Keep in Mind:
Applying all these conditions becomes a tedious task for EDA but using pandas profiling its apply within a second to give you a platter of data analysis, but remember it’s all applied by a set of rules — like plot boxplot and histogram for a continuous variable, Measures missing values, Calculate frequency if it’s a categorical variable — thus giving us opportunity to automate things. That’s the base of this python module pandas_profiling that helps us to automate the first-level of EDA of dataset.
If you find it useful, do clap and share it among your enthusiastic peers.