Data Preprocessing: Python, Machine Learning, Examples and more

QuantInsti

Contributor:
QuantInsti
Visit: QuantInsti

Data preprocessing is a basic requirement of any good machine learning model. Preprocessing the data implies using the data which is easily readable by the machine learning model. In this article, we will discuss the basics of data preprocessing and how to make the data suitable for machine learning models.

This article covers:

  • What is data preprocessing?
  • Why is data preprocessing required?
  • Examples of data preprocessing for different data set types with Python
  • Missing values
  • Dropping
  • Numerical imputation
  • Categorical imputation
  • Outliers
  • Overfitting
  • Data with no numerical values
  • Different date formats
  • Where can you learn more about data preprocessing?

What is data preprocessing?

Data preprocessing is the process of preparing the raw data and making it suitable for machine learning models. Data preprocessing includes data cleaning for making the data ready to be given to machine learning model.

Our comprehensive blog on data cleaning helps you learn all about data cleaning as a part of preprocessing the data, covers everything from the basics, performance, and more.

After data cleaning, data preprocessing requires the data to be transformed into a format that is understandable to the machine learning model.


Why is data preprocessing required?

Data preprocessing is mainly required for the following:

  • Accurate data: For making the data readable for machine learning model, it needs to be accurate with no missing value, redundant or duplicate values.
  • Trusted data: The updated data should be as accurate or trusted as possible.
  • Understandable data: The data updated needs to be interpreted correctly.

All in all, data preprocessing is important for the machine learning model to learn from such data which is correct in order to lead the model to the right predictions/outcomes.


Examples of data preprocessing for different data set types with Python

Since data comes in various formats, let us discuss how different data types can be converted into a format that the ML model can read accurately. Let us see how to feed correct features from datasets with:

  • Missing values
  • Outliers
  • Overfitting
  • Data with no numerical values
  • Different date formats

Missing values

Missing values are a common problem while dealing with data! The values can be missed because of various reasons such as human errors, mechanical errors, etc.

Data cleansing is an  important step before you even begin the algorithmic trading process, which begins with historical data analysis for making the prediction model as accurate as possible.

Based on this prediction model you create the trading strategy. Hence, leaving missed values in the data set can wreak havoc by giving faulty predictive results that can lead to erroneous strategy creation and further the results can not be great to state the obvious.

There are three techniques to solve the missing values’ problem in order to find out the most accurate features, and they are:

  • Dropping
  • Numerical imputation
  • Categorical imputation

Dropping

Dropping is the most common method to take care of the missed values. Those rows in the data set or the entire columns with missed values are dropped in order to avoid errors to occur in data analysis.

There are some machines that are programmed to automatically drop the rows or columns that include missed values resulting in a reduced training size. Hence, the dropping can lead to a decrease in the model performance.

A simple solution for the problem of a decreased training size due to the dropping of values is to use imputation. We will discuss the interesting imputation methods further. In case of dropping, you can define a threshold to the machine.

For instance, the threshold can be anything. It can be 50%, 60% or 70% of the data. Let us take 60% in our example, which means that 60% of data with missing out values will be accepted by the model/algorithm as the training data set, but the features with more than 60% missing values will be dropped.

For dropping the values, following Python codes are used:

By using the above Python codes, the missed values will be dropped and the machine learning model will learn on the rest of the data.

Numerical imputation

The word imputation implies replacing the missing values with such a value that makes sense. And, numerical imputation is done in the data with numbers.

For instance, if there is a tabular data set with the number of stocks, commodities and derivatives traded in a month as the columns, it is better to replace the missed value with a “0” than leaving them as it is.

With numerical imputation, the data size is preserved and hence, predictive models like linear regression can work better to predict in the most accurate manner.

A linear regression model can not work with missing values in the data set since it is biased toward the missed values and considers them “good estimates”. Also, the missed values can be replaced with the median of the columns since median values are not sensitive to outliers unlike averages of columns.

Let us see the Python codes for numerical imputation, which are as follows:

Categorical imputation

This technique of imputation is nothing but replacing the missed values in the data with the one which occurs the maximum number of times in the column. But, in case there is no such value that occurs frequently or dominates the other values, then it is best to fill the same as “NAN”.

The following Python code can be used here:

Outliers

An outlier differs significantly from other values and is too distanced from the mean of the values. Such values that are considered outliers are usually due to some systematic errors or flaws.

Let us see the following Python codes for identifying and removing outliers with standard deviation:

In the codes above, “lower” and “upper” signify the upper and lower limit in the data set.

Overfitting

In both machine learning and statistics, overfitting occurs when the model fits the data too well or simply put when the model is too complex.

Overfitting model learns the detail and noise in the training data to such an extent that it negatively impacts the performance of the model on new data/test data.

The overfitting problem can be solved by decreasing the number of features/inputs or by increasing the number of training examples to make the machine learning algorithms more generalised.

The most common solution is regularisation in an overfitting case. Binning is the technique that helps with the regularisation of the data which also makes you lose some data every time you regularise it.

For instance, in the case of numerical binning, the data can be as follows:

Stock valueBin
100-250Lowest
251-400Mid
401-500High

Here is the Python code for binning:

Your output should look something like this:

     Value    Bin
0     102     Low
1     300     Mid
2     107     Low
3     470     High

Stay tuned for the next installment in which Chainika Thakar will discuss data with no numerical values.

Visit QuantInsti to read the full article: https://blog.quantinsti.com/data-preprocessing/.

Disclosure: Interactive Brokers

Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from QuantInsti and is being posted with permission from QuantInsti. The views expressed in this material are solely those of the author and/or QuantInsti and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

In accordance with EU regulation: The statements in this document shall not be considered as an objective or independent explanation of the matters. Please note that this document (a) has not been prepared in accordance with legal requirements designed to promote the independence of investment research, and (b) is not subject to any prohibition on dealing ahead of the dissemination or publication of investment research.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.