I hope you are enjoying the “Advanced Analytics Introduction” blog post series; here is a link to the previous segments (Step One and Step Two) to provide some helpful background. In the previous installment, I provided a detailed definition of text analytics and mining concepts. In this blog post, I review the practice of “shallow parsing” of natural language content, which is one of the first steps involved in text mining and analysis activities.
Natural language content analysis involves identifying language types and characteristics, in order to break apart a collection of text and documents into individual words (Cheng Xiang Zhai et al., 2016). In other words, we are teaching a machine to recognize words in a particular language in order to process them for further analysis.
In the 1968 sci-fi thriller “2001: A Space Odyssey” (Kubrick et al., 1968), there is a famous dialog between the astronaut Dave Bowman and the computer H.A.L.
DAVE: Open the pod bay doors, Hal.
HAL: I’m sorry, Dave. I’m afraid I can’t do that.
DAVE: What’s the problem?
HAL: l think you know what the problem is just as well as l do.
Unlike 1968, today we take the ability to issue commands to computers using programs like Apple Siri and Amazon Alexa for granted, because they are (for the most part) able to easily understand and process what we are saying. Natural language processing (NLP) can represent text in many different ways, such as a string of characters or a sequence of words.
In NLP, a collection of text is called a “corpus” (a “body” of text). This isn’t helpful for text analytics because we are not recognizing individual words. A single item from a corpus is sometimes referred to as a “document” and can vary in size, e.g., a tweet versus a web page versus an academic research paper.
We are able to start performing text analytics when we gain access to a sequence of words, although this is more difficult to do in languages that are written in characters with no spaces such as Chinese.
So how do we take a text corpus and break it apart into a sequence of words? This process is known as text / word normalization and involves several steps.
The first step, tokenization, is the process of identifying where words and sentences end. In English, we look for blank spaces between words and periods at the end of sentences. The actual process considers many more complications like translating contractions like “it’s” into “it” “is.” Emojis also need to be translated into some text equivalent. (datacamp.com , 2019)
Let’s take the following sample text that I put into an online demo:
“Tokenizers split contractions like “it’s” into separate words.”
Notice this also separated punctuation. We have now “tokenized” our sentence!
The second step, stemming, is “the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in a language” (datacamp.com, 2019).
Let’s use the following test sentence for the stemming demonstration:
“Stemming gets to the root of all words in a sentence.”
The caveman-like output is strange, but remember, this step is for the benefit of machine learning (ML), not human understanding.
The third step, lemmatizing, “reduces the inflected words properly ensuring that the root word belongs to the language. In lemmatization, a root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words” (datacamp.com, 2019).
The following sample text demonstrates what happens in this process:
“He was drinking coffee and running at the same time.”
Notice that all the words in the lemmatization output are actual words in the English dictionary, unlike in the stemming example. Another interesting feature is how this transforms the state of being of the word “was” into the root word “be.”