# Exploratory Data Analysis in Python

Contributor:
QuantInsti
Visit: QuantInsti

Excerpt

What is Exploratory Data Analysis?

It is said that John Tukey was the one who introduced and made Exploratory data analysis a crucial step in the data science process. When asked what does it mean, he simply said, “Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.

The main aim of exploratory data analysis is to:

• Gain insight into the available data
• Find out any relation between the different variables
• Find anything which is out of the ordinary, ie outlier or anomaly
• Test any assumptions or instincts
• Find any optimum parameters or variable which will help us solve the problem statement faster

Yes, that’s probably it. You can see where the EDA process stands in the whole data science process below

The main component in exploratory data analysis is the visualisation of data. Let’s see how we perform Exploratory Data Analysis in the next section.

What are the types of EDA methods?

Existing literature tells us that there are four types of exploratory data analysis. Letâ€™s look at them below:

Univariate non-graphical method

Breaking down the name, univariate implies that there is just one variable and non-graphical is well, there is no visual element in this method.

There are plenty of examples in this method, which can vary from the height of NBA players in a team or only the opening price of Tesla Inc. in 2019. One of the Univariate non-graphical methods can be the 5 number summary of a variable.

Taking the example of Tesla Closing prices for 11 days, we will take only the closing prices and tabulate them below. Thus it would look something like this:

The five-number summary consists of the Minimum value, 1st Quartile, Median, 3rd Quartile, and Maximum number.

Let’s calculate this in Python:

`# import 11 day Tesla data import yfinance as yf tesla = yf.download('TSLA','2020-01-27', '2020-02-11') tesla['Close'] # Calculating the 5 number summary from numpy import percentile # calculate quartilesAll_quartiles = percentile(tesla['Close'], [25, 50, 75]) # calculate min/max Minimum, Maximum = tesla['Close'].min(), tesla['Close'].max() # print the five number summary print(Minimum) print(All_quartiles[0]) print(All_quartiles[1]) print(All_quartiles[2]) print(Maximum)`

You will get the following output:

Of course, apart from the following, you can always check the number of values, the mean etc.

By the way, you can also try the one-line command which pretty much gives you all the information you need in a simple format.

The Python code is:

`tesla['Close'].describe()`