This website uses cookies to collect usage information in order to offer a better browsing experience. By browsing this site or by clicking on the "ACCEPT COOKIES" button you accept our Cookie Policy.

Bag of Words: Approach, Python Code, Limitations

QuantInsti

Contributor:
QuantInsti
Visit: QuantInsti

In this blog, we will study the Bag of Words method for creating vectorized representations of text data. These representations can then be used to perform Natural Language Processing tasks such as Sentiment Analysis. We’ll understand the relevant terms, limitations, and further highlight the advantages of the method. The topics covered are:

  • Bag of Words Approach
  • Limitations of Bag of Words
  • Bag of Words vs Word2Vec
  • Advantages of Bag of Words

Bag of Words is a simplified feature extraction method for text data that is easy to implement. It involves maintaining a vocabulary and calculating the frequency of words, ignoring various abstractions of natural language such as grammar and word sequence.

Bag of Words Approach

The Bag of Words approach takes a document as input and breaks it into words. These words are also known as tokens and the process is termed as tokenization.

Unique tokens collected from all processed documents then constitute to form an ordered vocabulary. Finally, a vector of length equivalent to the size of the vocabulary is created for each document with values representative of the frequency of the tokens appearing in the respective document.

Note that, we ignore the order in which these words appear in our document. Hence the name ‘Bag of Words’ signifying the unordered collection of items in a bag. We can easily implement this approach in python. Below is an example demonstrating the same.

Approach_Bag_of_Words

# corpus is a collection of documents, here sentences
corpus = [‘This is the first sentence in our corpus followed by one more sentence to demonstrate Bag of words’,
‘This is the second sentence in our corpus with a FEW UPPER CASE WORDS and Few Title Case Words’]

vocab = [] # empty list for vocabulary
total_words = 0 # to count total words in corpus

for doc in corpus: # iterating through documents in corpus
token_temp = doc.split() # create tokens
total_words = total_words + len(token_temp)
for i in range(len(token_temp)):
if token_temp[i] not in vocab: # to check if word is already in vocab
vocab.append(token_temp[i])

vocab.sort()

print(vocab) # Print all the words in vocabulary
print(‘There are {} words in vocabulary.’.format(len(vocab)))
print(‘A total of {} words is used in documents.’.format(total_words))

Bag of Words

Note the difference in the number of total words and length of vocabulary. We’ll now calculate the frequencies of words appearing in each document and store it in a dictionary.

bow_vec = [] # list to store bag of words vectors

for i in range(len(corpus)):
doc_ = corpus[i].split()
doc_vec = [] # empty array for each doc

for j in range(len(vocab)): # iterate over vocab
if vocab[j] in doc_:
doc_vec.append(l_[i][vocab[j]]) # append freq if present
else:
doc_vec.append(0) # else append zero
bow_vec.append(doc_vec)

import pandas as pd
pd.set_option(“display.max_columns”, None)
df = pd.DataFrame(bow_vec, columns = vocab)
df # bag of words vectorized representation

frequency_bag_of_words1
frequency_bag_of_words2

Stay tuned for the next installment in this series, in which the author will discuss Limitations of Bag of Words.

To download the complete Python code, visit QuantInsti: https://blog.quantinsti.com/bag-of-words/

Disclosure: Interactive Brokers

Information posted on IBKR Traders’ Insight that is provided by third-parties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Third-party participants who contribute to IBKR Traders’ Insight are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from QuantInsti and is being posted with permission from QuantInsti. The views expressed in this material are solely those of the author and/or QuantInsti and IBKR is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

trading top