Pandas EDA Libraries you need in 2020 (Part 1) - Software Development for Everyone

Life is short, let Python automate your EDA

EDA (Exploratory Data Analysis) is one of the first steps performed on a given dataset. It helps us to understand more about our data and gives us an idea of manipulations and cleaning we might have to do. EDA can take anywhere from a few lines to a few hundred lines. In this tutorial, we will look at libraries which help us perform EDA in a few lines

Dataset

We will use the Titanic Dataset provide by Kaggle. Using Panda’s describe() method, we get the below output

As you can see the Age Column has missing values. The below libraries are basically describe() on steroids.

1. Pandas-Profiling

Screencast of EDA Report Generated by Pandas Profiling

Install and Usage

First, we will instal the library

pip install pandas-profiling

Next, we will import the library and generate the report

import pandas_profiling

prof_report = pandas_profiling.ProfileReport(df , title = 'Titanic Report')

To display it inside the notebook

prof_report.to_widgets()

To generate it as an HTML file

prof_report.to_html()

Key Features in the Report

Panda Profile Report screenshot by Author

A brief overview of your data consisting of the number of missing rows, duplicate rows and your number of categories, numerical values etc

Warnings based on the distribution of data, number of missing values, zero values etc

Data Distributions and Distinct, Missing values for each column

Interactions and Correlation between the various features

A count of the missing values for each Feature

2. SweetViz

Screencast of EDA Report Generated by Sweetviz

Install and Usage

First, we will instal the library

pip install sweetviz

Next, we will import the library and generate the report

import sweetviz
import pandas as pd

df = pd.read_csv('train.csv')
report = sweetviz.analyze(df)
report.show_html()

You can also pass a file name to show_html()

report.show_html("Titanic.html")

By default, it’s named ‘SWEETVIZ_REPORT.html’

Key Features

An overview of the data frame is provided. It displays the number of duplicate rows and the number of types of features.

The association between the different features. It provides a really intuitive heatmap. As you can see, the box relating Fare and P-class is very prominent which makes sense since a first-class passenger would pay more than a third-class passenger.

For each categorical feature, the following relevant information is shown

Data Distribution
Features which give information on it
Features it can give information about
Its correlation with other features

For numerical features, it shows the numerical and categorical associations and distributions

It also highlights the missing values based on the percentage of missing values.

3. Autoviz

Screencast of EDA Report Generated by Autotviz

Install and Usage

First, we will instal the library

pip install autoviz

Next, we will import the library and generate the report

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz('train.csv')

Key Features

It provides a scatter plot between continuous variables

Distribution of the data for the various features

A heatmap and bar plot to show the relationship between continuous features.

Other EDA Libraries

In a future article, I will discuss some of the below-mentioned libraries but in the meantime I recommend you to check out the resources listed.

Life is short, let Python automate your EDA

Dataset

1. Pandas-Profiling

Install and Usage

Key Features in the Report

2. SweetViz

Install and Usage

Key Features

3. Autoviz

Install and Usage

Key Features

Other EDA Libraries

Pandas GUI

Dataprep

D-tale

Dora

Bamboolib