I hope you are enjoying the “Advanced Analytics Introduction” blog post series; here is a link to the previous segments (Step One, Step Two, Step Three, Step Four) to provide some helpful background. In the previous installment, I continued to review the practice of “shallow parsing” of natural language content. In this post, I will examine word association, mining, and analysis.
There are two kinds of relationships words can have within a sentence.
Words with a high context similarity have a paradigmatic relationship. Words with these kinds of relationships can be substituted for each other and still result in a valid sentence. In the diagram below the words “cat” and “dog” can be substituted for one another without affecting the overall meaning of the sentence. By contrast, if we were to substitute “dog” with “computer,” the sentence would no longer be valid.
Figure 1. Example of a Paradigmatic Relationship (Cheng Xiang Zhai et al., 2016)
Words that have a high co-occurrence with other words, but have an overall low occurrence, have a syntagmatic relationship. Unlike paradigmatic relationships, we are looking at how many times two words occur together in a context and are then comparing this to their individual occurrences. In the diagram below we are focused on the word “eats,” and, more specifically, looking at how this relationship can predict what other words it’s likely to be associated with:
Figure 2. Example of a Syntagmatic Relationship. (Cheng Xiang Zhai et al., 2016)
Now that we understand what kind of relationships words can have within a body of text, which we will refer to as a “document” from this point onwards, let’s discuss the techniques to discover the similarity between documents.
Many of the approaches to finding similarities between documents are based on the principal of distributional semantics. The basic idea of distributional semantics is that documents with similar word distributions have similar meanings.
So, how can we represent text in a way to understand word distribution?
One of the first tasks is to turn text into numbers or features, which can then be processed to uncover patterns and relationships. The most commonly used pre-processor step is called a “Bag-of-Words Model” (BoW). The BoW Model counts the frequency of words within a document. It is called a “bag of words” because the order of words is discarded during the process of creating the word distribution. The model is only looking at word occurrence within a document.
Here is a very simple example from an article on this subject by Jason Brownlee, Ph.D. He is taking an excerpt from the book “A Tale of Two Cities” by Charles Dickens. Each line below is comprised of ten words, and we are going to treat each of the four lines as a “document”. The entire “corpus” consists of 24 words.
“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
“it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
“it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Once we score all the words in each document we can begin to see similarities and differences. Each document is now represented as a “Binary Vector” (more on this later when discussing the Vector Space Model).
Some of the initial tasks you can perform with the BoW model involve simply counting the number of times a word appears in a document. You can also calculate the word frequency within a document. The frequency of each word represents the overall probability of its existence within a given document.
Word frequency becomes problematic in the BoW model because of low-value words (e.g., “it”, “the”, and “of”). This causes the important words within the document to become lost in what is essentially word noise.
This is where we use a different weighting technique called “Term Frequency-Inverse Document Frequency (TF-IDF),” which is a simple function that puts a higher emphasis on infrequently occurring words. This technique tells you how rare a word is across documents.
The diagram below from datameetsmedia.com illustrates the use of TF-IDF. It considers the following three example documents:
“I love dogs”
“I hate dogs and knitting”
“Knitting is my hobby and my passion”