Exploratory data analysis (EDA) — 5 steps Quick Guide

Anuroobika K
5 min readMay 26, 2022

--

In this blog, I’ll be discussing about descriptive statistics- the set of techniques that is used to describe data. I’ll cover description of data through numbers, measures of location, measures of dispersion, measures of correlation and also how to summarize the data.

While EDA may not be the most complex thing in data science, it is very much time-consuming and thus, many overlook this step and jump to solve the main problem. However, it is important to understand the dataset variables and the relationship among them before we proceed to any complex analysis. Based on what I’ve learnt from various lectures, courses and past work experience, I would try a comprehensive analysis of the data. This could act as a quick guide for your EDA as well.

Below are the 5 steps that we could use to understand the data better through EDA:

1. Understand the problem:
Unless we understand the problem that we’re trying to solve using the data, we will not be able to effectively deal with the data. Understanding the problem gives you the lens through which you see the data. If you do not know what problem you’re trying to solve, it is easy to get lost in the data.

For example, based on your problem understanding, you might have to decide whether you want to deal the data at the base-unit level or roll-it up to higher level such as transaction level vs. client level.

Post the problem statement analysis, understand the meaning of each variable and the importance of each variable for the particular problem.
Following the below steps can be helpful:
a) Look at the names, description and type for each variable. Type could be numerical or categorical.
b) Segment — check if you could group the variables based on their description under broad categories such as product/service details, personal info, performance metrics etc.
c) Expectation- Note down the initial intuitive/domain knowledge wise expectation that you have about the importance of each variable towards the problem statement.
d) Conclusion — After the initial analysis, revise the level of importance level for each variable and note down the conclusion using the same level of measurement as in ‘Expectation’ variable.
e) Comments- Any comments that you want to write down for the variables

Create a table and note down your observations for all variables.

2. Univariate study:
Focus on the variables and try to understand a little more about the variables especially the dependent variable.

Some options for univariate study are:
a) distribution study using percentiles
b) histogram
b) measurement of skewness and kurtosis
There are functions in python such as describe(), seaborn.distplot(), skew() and kurt() for these studies.

3. Multivariate study:
Next step is to understand how the dependent and independent variables relate to each other. It would be helpful to get some understanding such as whether the relationship is linear or exponential or negative or there is no relationship at all.

Initially, pick up few variables which you believe are more meaningful for the problem statement and study their relationship. Relationship between two numerical variables can be studied using scatter plot and the one between a numerical and categorical variable can be studied using box-plot. Though you will not be able to gauge the degree of relationship, it is a good point to start with identifying the relationship type between the variables.

Till now, it has been a subjective analysis in this section. The following steps can help in some objective analysis.
a) Correlation matrix with all variables (Heatmaps are great to detect multicollinearity)
b) Zoomed correlation matrix with set of interested variables based on results from above step
c) Scatter plot between more correlated variables (seaborn scatterplots show the mega picture and give reasonable idea about variables relationships)

4. Basic cleaning:
In this step, we will clean the data by handling missing values, outliers and categorical variables.
When it comes to missing values, we need to check
a) how widespread is the missing data?
b) what is the cause of the missing data? is it random or does it have any pattern?
The answers to these questions will help us decide how to handle them. Missing values could cause reduction in sample size and the analysis could be biased. If the variable that has missing values isn’t important to the problem statement, then it could be deleted especially if the missing data is more than 15% of the overall data. Sometimes, missing values mean zero for example if revenue from a client is a missing value for a particular month, then may be it could be replaced by zero. Other times, if the missing value is less than 5% of the observations, then we could delete those observations.

Outliers:
Outliers can be source of valuable information about specific behaviors but they can also affect your model adversely if not taken care of.
Along with the distribution study using percentiles, we could also use standardization techniques to bring the mean to zero, standard deviation to one and check the values distribution. This will be helpful to identify the outliers in individual variables. But there could also be outliers when two or more variables interact with each other. If in a scatter plot between two variables, any observations are outside the general trend, they could be treated as outliers.

And, we could convert the categorical variables into dummy variables (i.e. set value of 1 or 0). This could be easily done using get_dummies() function.

5. Test assumptions:
Lastly, we could check for some basic statistical assumptions if we are to use multivariate techniques for further analysis. The below are the four assumptions to be tested according to the standard Multivariate Data Analysis procedure.
a) Normality: As many statistical tests rely on this, it is important to test for normality which is checking if data look like a normal distribution. If the sample size is big (i.e >200 samples), then this isn’t a serious issue. However, solving for this, helps in avoiding related problems such as heteroscedasticity.
b) Homoscedasticity: it means equal level of variance across the range of dependent variable. Thus, we want error term to be constant for all values of the independent variables.
c) Linearity: This can be checked using scatter plots. If patterns are not liners, we could try data transformations.
d) Absence of correlated errors: This happens when an error is related to another and thus happens systematically. This can be fixed by identifying the variable that can explain the effect and adding it to the model.

Conclusion:
We have explored the 5 steps process for detailed exploratory data analysis. Hope you found it helpful. You could conclude this exercise by summarizing your insights about the key variables using the table that you would have created at step-1 understanding the problem. Happy Analyzing!

--

--

Anuroobika K
Anuroobika K

Written by Anuroobika K

Writes about data science topics in simple words and also enjoys writing about life skills. Connect on https://www.linkedin.com/in/anuroobika-k-905b8823/

Responses (1)