I hope you’re enjoying the “Advanced Analytics Introduction” blog post series; here is a link to the previous segments (Step One, Step Two, and Step Three) to provide some helpful background. In the previous installment, I reviewed the practice of “shallow parsing” of natural language content, which is one of the first steps involved in text mining and analysis activities. In this post, I will explore the concept of topic mining or “deep parsing” of natural language content.
Just as a reminder, up to this point in the series, we have been working with getting text ready for processing using various Natural Language Processing (NLP) techniques, such as tokenization, part-of-speech tagging, stemming, and others. We also reviewed word association concepts such as paradigmatic and syntagmatic relationships and discussed techniques to find similarities between bodies of text such as the Vector Space Model and Cosine Similarity Model.
Topic mining is where the fun begins with text analytics, because we start to get a deeper analytic value from text. Topic mining, also known as topic modeling, involves extracting prominent themes from a text corpus.
One obvious use case for topic mining is to extract key themes from large volumes of social media data. For example, you could collect Twitter feeds through an Application Programming Interface (API) that references either your company name or industry to get a better understanding of drivers of demand generation.
The main premise of topic mining is that we are working with an unknown set of text and are attempting to extract hidden patterns and prominent themes that best represent the underlying meaning of text.
Unlike the “Bag-of-Words” model, topic mining does not require any prior annotations or labeling of the documents. Since there is no upstream data classification done by humans, topic mining is considered to be an unsupervised Machine Learning process.
A topic model, which is a statistical mathematical algorithm, looks at the frequency and distribution of words and phrases within each document in a text corpus and automatically creates clusters of words called “topics”. Topics should contain words that best characterize the overall contents of each document.
Figure 1. Illustration of the General Idea of Topic Modeling (analyticsvidhya.com, 2018)
A single document can contain multiple topics. Topics will have different distributions within and between documents, and usually, one dominant topic will emerge. Topics are not assigned names by the process. As humans, we would review the topics that were identified and make additional inferences. For instance, we can say the “yellow” topic below could be called “non-verbal communication.”
Figure 2. Illustration of Identifying Topics in a Document (analyticsvidhya.com, 2018)
There are two prominent models in the field of topic mining. The first is called “Latent Semantic Analysis” (LSA).” LSA is a foundational technique in topic modeling.
Dictionary.com defines the word “latent” as follows: “present but not visible, apparent, or actualized; existing as potential.” The hidden semantics we are trying to find include the following:
Synonymy: Many ways to refer to the same thing (“car,” “automobile”)
Polysemy: Many meanings for the same word (“I lost the key to the house”, “He is singing in the wrong key”)
The critical point to understand is that LSA mixes and matches the data so that we can measure the true similarity of words regardless of the semantic issues mentioned above.
The LSA process begins with creating a Document-Term (Word) Matrix. This identifies the existence of a word within a given document and assigns it a weight using a method called Term Frequency-Inverse Document Frequency (TF-IDF).
Figure 3. Illustration of a Document-Term Matrix (analyticsvidhya.com, 2018)
Next, Singular Value Decomposition (SVD) is applied to the Document-Term Matrix in order to gain multiple perspectives related to topics and documents.