The article “Text-Based Factor Investing” first appeared on Alpha Architect Blog.
Part 1: The End of Accounting
- This is the first part of a series of guest posts by Kai Wu, the CIO & Founder of Sparkline Capital.
The Factor Zoo
As readers of Alpha Architect’s blog, you’re certainly familiar with factor investing. Factors are quantifiable firm characteristics that explain cross-sectional stock returns. While some factors merely explain risk (e.g., industry), others are also associated with positive expected returns (e.g., value, momentum).
Since the dawn of academic finance, researchers have identified hundreds of factors. In the past decade, the number of published factors has proliferated exponentially.
Source: Harvey and Liu (2019)
However, all these factors have one thing in common. They are based on two types of data:
- Market data: price, volume
- Accounting data: sales, earnings, book value, cash flow
In The End of Accounting, Baruch Lev and Feng Gu, two accounting professors, lament that accounting has largely remained unchanged for the past century. They point to the fact that US Steel’s 1902 annual report has essentially the same financial information as its 2012 report (but with far fewer stock photos).
Source: Chronicling America, Sparkline
So while the universe of accounting data has remained static for a century, academics are somehow still finding new signals in this haystack. Hmm…
Lev and Gu also make the point that the lack of accounting reform means it’s still optimized for the industrial era of the early 1900s. However, most value today is derived from intangible assets (e.g., intellectual property and human capital). This is a second reason why mining accounting data may prove to be a fruitless endeavor.
The Rise of Unstructured Data
Accounting data may be the only form of data that hasn’t grown over the past century. Every few years, more data is created than has existed in all of human history. However, 80% of this data is unstructured. This means the data does not live in an Excel spreadsheet or SQL database. Instead, it takes the form of text, images, video, and other jumbled messes.
Source: IDC, Sparkline
Most of the information in company annual reports is unstructured. Glossy photos aside, there is actual useful data in, for instance, the management discussion and analysis (MD&A) section. However, this section is unstructured text, which is unintelligible for quants with traditional econometrics.
The general problem is that unstructured data is high dimensional. A single 10K can contain a vocabulary of tens of thousands of unique words. This is not good for common statistical techniques such as linear regression. Fortunately, we have a new weapon in our arsenal: natural language processing (NLP).
The field of NLP has exploded over the past decade. You may have heard of the recently released OpenAI GPT-3 model, which has been used to generate essays, poetry, and artwork. Like Moore’s Law, the power of NLP has been increasing at an exponential rate (note the log Y-axis).
However, we don’t even need these cutting edge models to derive meaning from financial text (such as MD&A). Even much simpler techniques can produce powerful and robust results. We’ll provide an example later.
A Brave New World
Let’s now return to our initial problem. We believe accounting data is “tapped out,” implying that despite herculean effort, the academy’s well-intentioned quest to find more accounting factors may be fruitless, or worse, data mining. But once you cross the Rubicon into the world of unstructured data, suddenly the fruit is hanging much lower.
Once we’ve cast these shackles off, there are a nearly infinite number of dimensions to explore. We can now start defining factors that would be more familiar to those of a fundamental analyst. Which companies are implementing disruptive technology? Which firms are employing a platform business model? Which companies are winning the war for talent?
These text-based factors are similar to traditional factors. In a statistical sense, they are quantifiable company characteristics that explain cross-sectional variance and may also have positive expected returns. In a broader sense, they capture important missing dimensions of business and markets.
We’ll provide just a single example of a text-based factor in this post. However, in future posts we will cover additional examples. Hopefully these together will provide a good sense of why we believe text-based factors can be a valuable addition to the factor investor’s arsenal.
Visit Alpha Architect Blog for additional insight on this topic:
Disclosure: Alpha Architect
The views and opinions expressed herein are those of the author and do not necessarily reflect the views of Alpha Architect, its affiliates or its employees. Our full disclosures are available here. Definitions of common statistics used in our analysis are available here (towards the bottom).
This site provides NO information on our value ETFs or our momentum ETFs. Please refer to this site.
Disclosure: Interactive Brokers
Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from Alpha Architect and is being posted with permission from Alpha Architect. The views expressed in this material are solely those of the author and/or Alpha Architect and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.