If you’ve done any type of data analysis in Python, chances are you’ve probably used pandas. Though widely used in the data world, if you’ve run into space or computational issues with it, you’re not alone. This post discusses several faster alternatives to pandas.
R’s data table in Python
If you’ve used R, you’re probably familiar with the data.table package. A port of this library is also available in Python. In this example, we show how you can read in a CSV file faster than using standard pandas. For our purposes, we’ll be using an open source dataset from the UCI repository.
import datatable start = time.time() os_scan_data = datatable.fread("OS Scan_dataset.csv", header = None) end = time.time() print(end - start)
Using datatable, we can read in the CSV file in ~20 seconds. Reading the same file using pandas takes almost 76 seconds!
Next, we can also sort faster with datatable.
start = time.time() os_scan_data.sort() end = time.time() print(end - start)
In datatable, this takes ~0.002 seconds, but takes ~0.934 seconds in pandas.
In a later article, we’ll go into more detail with datatable. You can check out its documentation by clicking here.
The modin package
modin is another pandas alternative to speed up functions while keeping the syntax largely the same. modin works by utilizing the multiple cores available on a machine (like your laptop, for instance) to run pandas operations in parallel. Since most laptops have between four and eight cores, this means you can still have a performance boost even without using a more powerful server.
First, let’s install modin using pip. For this step, we’re going to install all the dependencies, which includes dask and ray. These will not be installed if you leave out the “[all]” piece of the installation command.
pip install modin[all]
Next, we can get started by importing the package like below. We’ll also import the time package to compare runtimes.
import modin.pandas as pd import time
For this example, we’ll be using the dataset found here.
os_scan_data = pd.read_csv("OS Scan_dataset.csv", header = None)
Also, we’re going to increase the size of the dataset artificially by simply duplicating the rows multiple times:
combined_data = pd.concat([os_scan_data, os_scan_data, os_scan_data])
This gives us a dataset with over 5 millions rows and and 115 columns.
Next, let’s create a new column using the map function. Using modin, we’ll able to generate the new field in around 0.03 seconds.
start = time.time() combined_data["test"] = combined_data.map(lambda val: "above" if val > 3 else "below") end = time.time() print(end - start)
If we were to use normal pandas, we get the following result at ~1.34 seconds.
Check out more about modin by clicking here.
The PandaPy library
PandaPy is another alternative to pandas. According to its documentation page, PandaPy is recommended as a potential faster alternative to pandas when the data you’re dealing with has less than 50,000 rows, but possibly as high as 500,000 rows, depending on the data. Another benefit of this package is that it often reduces the amount of memory needed to store datasets when you have mixed data types.
PandaPy can be download via pip:
pip install pandapy
For this example, we’ll use a credit card dataset from Kaggle. Now, we can read in the data. In PandaPy, the dataset is read in as a structured array.
import pandapy as pp # read in dataset credit_data = pp.read("creditcard.csv") # get descriptive stats pp.describe(credit_data)
General column operations are similar – for example, we can divide two columns just like in pandas:
credit_data["V1"] / credit_data["V2"]
Similarly, we can get the mean of a column just like pandas:
See documentation for PandaPy here.
Several pandas functions can be implemented more efficiently using numpy. For example, if you want to calculate quantiles, like the 90% or 95%, etc., you can use either pandas or numpy. However, numpy will generally be faster.
# pandas start = time.time() q = np.arange(0.05, 1, 0.05) quantiles = [email_data.W.quantile(val) for val in q] end = time.time() print(end - start) # numpy start = time.time() q = np.arange(0.05, 1, 0.05) quantiles = [np.quantile(email_data.W, val) for val in q] end = time.time() print(end - start)
That’s all for this post! These are just a few of the alternatives to pandas. If you’d like to see tutorials on other alternatives, feel free to let me know. Also, if you enjoyed reading this article, make sure to share it with others! Check out my other Python posts by clicking here.
Visit TheAutomatic.net for additional insight on this article: http://theautomatic.net/2021/10/09/faster-alternatives-to-pandas/.
Disclosure: Interactive Brokers
Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from TheAutomatic.net and is being posted with permission from TheAutomatic.net. The views expressed in this material are solely those of the author and/or TheAutomatic.net and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
In accordance with EU regulation: The statements in this document shall not be considered as an objective or independent explanation of the matters. Please note that this document (a) has not been prepared in accordance with legal requirements designed to promote the independence of investment research, and (b) is not subject to any prohibition on dealing ahead of the dissemination or publication of investment research.