Life is short, let Python automate your EDA
EDA (Exploratory Data Analysis) is one of the first steps performed on a given dataset. It helps us to understand more about our data and gives us an idea of manipulations and cleaning we might have to do. EDA can take anywhere from a few lines to a few hundred lines. In this tutorial, we will look at libraries which help us perform EDA in a few lines
Dataset
We will use the Titanic Dataset provide by Kaggle. Using Panda’s describe() method, we get the below output
As you can see the Age Column has missing values. The below libraries are basically describe() on steroids.
1. Pandas-Profiling
Install and Usage
First, we will instal the library
pip install pandas-profiling
Next, we will import the library and generate the report
import pandas_profiling
prof_report = pandas_profiling.ProfileReport(df , title = 'Titanic Report')
To display it inside the notebook
prof_report.to_widgets()
To generate it as an HTML file
prof_report.to_html()
Key Features in the Report
A brief overview of your data consisting of the number of missing rows, duplicate rows and your number of categories, numerical values etc
Warnings based on the distribution of data, number of missing values, zero values etc
Data Distributions and Distinct, Missing values for each column
Interactions and Correlation between the various features
A count of the missing values for each Feature
2. SweetViz
Install and Usage
First, we will instal the library
pip install sweetviz
Next, we will import the library and generate the report
import sweetviz import pandas as pd
df = pd.read_csv('train.csv') report = sweetviz.analyze(df) report.show_html()
You can also pass a file name to show_html()
report.show_html("Titanic.html")
By default, it’s named ‘SWEETVIZ_REPORT.html’
Key Features
An overview of the data frame is provided. It displays the number of duplicate rows and the number of types of features.
The association between the different features. It provides a really intuitive heatmap. As you can see, the box relating Fare and P-class is very prominent which makes sense since a first-class passenger would pay more than a third-class passenger.
For each categorical feature, the following relevant information is shown
- Data Distribution
- Features which give information on it
- Features it can give information about
- Its correlation with other features
For numerical features, it shows the numerical and categorical associations and distributions
It also highlights the missing values based on the percentage of missing values.
3. Autoviz
Install and Usage
First, we will instal the library
pip install autoviz
Next, we will import the library and generate the report
from autoviz.AutoViz_Class import AutoViz_Class AV = AutoViz_Class() df = AV.AutoViz('train.csv')
Key Features
It provides a scatter plot between continuous variables
Distribution of the data for the various features
A heatmap and bar plot to show the relationship between continuous features.
Other EDA Libraries
In a future article, I will discuss some of the below-mentioned libraries but in the meantime I recommend you to check out the resources listed.