# Introduction to Statistical Thinking – Part I Contributor:
QuantInsti
Visit: QuantInsti

Statistical thinking is an approach to process information through the lens of probability and statistics so as to make informed decisions.

This series of blogs takes you through a journey where we begin with introducing statistical thinking, make a brief stopover to understand Bayesian statistics and then dwell on its applications in financial markets using Python.

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write!”
H.G. Wells (1866-1946), the father of science fiction

Making choices is a part of our daily lives, be it personal or professional. If you apply statistical thinking wherever possible, you can make better choices.

In this article, we’ll go step by step in deconstructing the decision-making process under limited information. We’ll look at some examples, the jargon and the importance of statistics in the process.

• What is statistics?
• What is a statistical question?
• Why do we need statistics?
• Descriptive statistics vs Inferential statistics
• Should we use descriptive statistics or inferential statistics?
• Jargon in statistics
• Population
• Sample
• Observation
• Statistic
• Parameter
• Hypothesis
• Hypothesis testing
• Estimate
• Why should we spend time on statistical inference?

## What is statistics?

There are two ways to define statistics. Formally statistics is defined as “The science of statistics deals with the collection, analysis, interpretation, and presentation of data.

Intuitively, statistics is defined as “Statistics is the science of making decisions under uncertainty.

That is, statistics is a tool that helps you make decisions when you don’t have complete information.

## What is a statistical question?

Source: https://www.istockphoto.com/photos/four-cats

Looking at the above image, let’s address some questions!

How many cats does the above picture have?
4, right?

Do we have all the information to answer this question?
Yes.

Do all healthy cats have four legs?
Yes.

Do we have all the information to answer this question?
No. Because this is a picture of only 4 out of all the existing cats in the world!

But can we still answer it with certainty?
Yes.

So, is it a statistical question?
No.

Why?
Because if you have all the information to answer the question or if you can answer this question with certainty, it’s not a statistical question.

For a question to be a statistical question,

• The question has to go beyond the available information, and
• The question shouldn’t be answerable with certainty.

This concept will be reinforced repeatedly in this article, i.e., statistics is the science of decision making under uncertainty.

## Why do we need statistics?

We now work with a toy example through this post to answer the above question.

Suppose we decide to design a Quantra course on Julia programming.

• How do we decide if we should put time and effort into building this course?
• What if our designed course fails and doesn’t get many interested users?

These are important business decisions that require substantial resources. Therefore, we decide to survey if such a course would sell.

Now, that raises the following questions:

• Who would our potential paid users be?
• Who should we approach? Programmers? Data scientists? Researchers? College graduates? Quantitative Analysts?
• Ideally, all of them, right?

However,

• So, what should we do?
• Should we drop the idea of designing the new course?

That doesn’t sound right.

If we had access to all the people, the process would have been simple. If the majority say that they would buy such a course, you create it. If not, then drop it.

However, since we can’t do it, we do the next best thing, i.e. we ask the maximum number of people we can reach out to, and, based on their response, we estimate the likelihood of this course being successful.

To calculate this estimate, we need statistics.

To generalize this idea, in real-world scenarios, we rarely have complete information related to the decision we want to make, whether for individuals or businesses.

Hence, we need a tool that can help us decide with limited information. Statistics is one such tool, and making these decisions within a statistical framework is called statistical thinking.

Statistical thinking is not just about using formulas to calculate p-values and z-scores; it’s a way to think about the world. Once you internalize this idea, it will change how you see the world. You’ll start thinking in terms of probabilities instead of certainties, which will help you make better decisions in your professional and personal life.

## Descriptive statistics vs Inferential statistics

Descriptive statistics is the process of taking the data and describing its features using measures of central tendency (mean, median and mode), measures of dispersion (standard deviations, interquartile range ), etc.

However, inferential statistics is about working with the limited data and using it to infer something about a larger question we pose to ourselves a priori. This question cannot be answered with certainty.

Our article focuses on the latter, i.e. inferential statistics.

## Should we use descriptive or inferential statistics?

It depends on the question you’re asking and the available data. A simple question to ask yourself while deciding which one to use is:

• Do we want to describe the existing data? OR
• Do we want to draw inferences from the existing data (sample) to extrapolate about the population?

We go with descriptive statistics for the former and inferential statistics for the latter.

## Jargon in statistics

Let’s look at some of the key terms used in statistics that will help you in understanding the concepts better.

### Population

The universe of items we’re interested in. Going back to our Quantra course example, the population would be every person in this world who would be interested in the Julia course.

### Sample

It is a subset of the population, i.e. the amount of information we can get. This could be the Quantra or EPAT user base we have. We could frame our question as: How likely are you to buy a course on Julia (on a scale of 1 to 10)?

### Statistic

A summary measure of the data available, i.e. from the sample. Here, it could be the average score of say, 7 obtained from Quantra and EPAT users for the above question.

### Parameter

parameter is a summary measure of the population. Here, it could be the average score of say, 6 obtained from the population (as defined above).

statistic is a summary measure of the existing data (sample), whereas a parameter is the same for the population.

### Hypothesis

A description of how we think the world works. We hypothesize that EPAT and Quantra users are unlikely to buy a course on Julia (rating of 1). This is the assumption we start with that we call the null hypothesis.

### Null Hypothesis

It’s crucial to have a null hypothesis before starting with any statistical analysis. And the null hypothesis is mostly status quo. The alternative hypothesis is the theory that you think could be true and are looking for evidence to verify it.

So to clarify, our null hypothesis H0H0 and alternative hypothesis H1H1 here are H0H0: EPAT and Quantra users are unlikely to buy a course on Julia (Mean rating = 5)

H1H1: EPAT and Quantra users are likely to buy the course (Mean rating >=5)

### Hypothesis testing

Hypothesis testing is a method to draw conclusion about the data from the sample i.e. to test whether a hypothesis is correct or not.

### Estimate

And estimate can be defined as a variable that is the best guess of the actual value of the parameter.

Stay tuned for the next installment in which the authors will answer why we should spend time on statistical inference.