This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.

Faster Data Exploration with DataExplorer

TheAutomatic.net

Contributor:
TheAutomatic.net
Visit: TheAutomatic.net

By:

Blogger, TheAutomatic.net, and Senior Data Scientist

Data exploration is an important part of the modeling process. It can also take up a fair amount of time. The awesome DataExplorer package in R aims to make this process easier. To get started with DataExplorer, you’ll need to install it like below:

install.packages("DataExplorer")

Let’s use DataExplorer to explore a dataset on diabetes.

# load DataExplorer
library(DataExplorer)
 
# read in dataset
diabetes_data <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv", header = FALSE)
 
# fix column names
names(diabetes_data) <- c("number_of_times_pregnant", "plasma_glucose_conc", "diastolic_bp", "triceps_skinfold_thickness", "two_hr_serum_insulin", "bmi", "diabetes_pedigree_function", "age", "label")
 
# create report
create_report(diabetes_data)

Running the create_report line of code above will generate an HTML report file containing a collection of useful information about the data. This includes:

  • Basic statistics, such as number of rows and columns, number of columns with missing data, count of continuous variables vs. discrete, and the total memory allocation
  • Data type for each field
  • Missing data percentages for each column
  • Univariate distribution for each column
  • QQ plots
  • Correlation analysis
  • PCA



That’s right – a single line of code can generate all of the above for a given dataset! It’s also possible to get each of these pieces individually. For example, in a single line of code, we can generate histograms for all the numeric variables in the dataset.

plot_histogram(diabetes_data)
Faster Data Exploration with DataExplorer

Similarly, we can get bar plots for all categorical variables in the dataset

plot_bar(diabetes_data)

Here’s an example getting the correlation plot:

plot_correlation(diabetes_data)

Configuring the report

It’s also possible to make adjustments to the output generated by create_report. For example, if you don’t want the QQ plots, you could set add_plot_qq = FALSE

config <- configure_report(add_plot_qq = FALSE)
 
create_report(config = config)

One hot encoding

DataExplorer also comes with a function to perform one hot encoding. You can one hot encode all the categorical variables in the dataset by passing the data frame name to the dummify function. In this case, we don’t have any categorical variables to encode, so the function will generate a warning.

dummify(diabetes_data)

Visit TheAutomatic.net blog for additional insight on this topic and to find DataExplorer scripts and documentation.

Disclosure: Interactive Brokers

Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from TheAutomatic.net and is being posted with permission from TheAutomatic.net. The views expressed in this material are solely those of the author and/or TheAutomatic.net and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

In accordance with EU regulation: The statements in this document shall not be considered as an objective or independent explanation of the matters. Please note that this document (a) has not been prepared in accordance with legal requirements designed to promote the independence of investment research, and (b) is not subject to any prohibition on dealing ahead of the dissemination or publication of investment research.

Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.

trading top