Deep Parsing of Natural Language Content: Step Four in Advanced Analytics Introduction

I hope you’re enjoying the “Advanced Analytics Introduction” blog post series; here is a link to the previous segments (Step One, Step Two, and Step Three) to provide some helpful background. In the previous installment, I reviewed the practice of “shallow parsing” of natural language content, which is one of the first steps involved in text mining and analysis activities. In this post, I will explore the concept of topic mining or “deep parsing” of natural language content.

Just as a reminder, up to this point in the series, we have been working with getting text ready for processing using various Natural Language Processing (NLP) techniques, such as tokenization, part-of-speech tagging, stemming, and others.  We also reviewed word association concepts such as paradigmatic and syntagmatic relationships and discussed techniques to find similarities between bodies of text such as the Vector Space Model and Cosine Similarity Model.

Topic mining is where the fun begins with text analytics, because we start to get a deeper analytic value from text.  Topic mining, also known as topic modeling, involves extracting prominent themes from a text corpus.

One obvious use case for topic mining is to extract key themes from large volumes of social media data. For example, you could collect Twitter feeds through an Application Programming Interface (API) that references either your company name or industry to get a better understanding of drivers of demand generation.

The main premise of topic mining is that we are working with an unknown set of text and are attempting to extract hidden patterns and prominent themes that best represent the underlying meaning of text.

Unlike the “Bag-of-Words” model, topic mining does not require any prior annotations or labeling of the documents. Since there is no upstream data classification done by humans, topic mining is considered to be an unsupervised Machine Learning process.

A topic model, which is a statistical mathematical algorithm, looks at the frequency and distribution of words and phrases within each document in a text corpus and automatically creates clusters of words called “topics”. Topics should contain words that best characterize the overall contents of each document.

Figure 1. Illustration of the General Idea of Topic Modeling (, 2018)

A single document can contain multiple topics. Topics will have different distributions within and between documents, and usually, one dominant topic will emerge. Topics are not assigned names by the process. As humans, we would review the topics that were identified and make additional inferences. For instance, we can say the “yellow” topic below could be called “non-verbal communication.”

Figure 2. Illustration of Identifying Topics in a Document (, 2018)

There are two prominent models in the field of topic mining. The first is called “Latent Semantic Analysis” (LSA).” LSA is a foundational technique in topic modeling. defines the word “latent” as follows: “present but not visible, apparent, or actualized; existing as potential.” The hidden semantics we are trying to find include the following:

  • Synonymy: Many ways to refer to the same thing (“car,” “automobile”)

  • Polysemy: Many meanings for the same word (“I lost the key to the house”, “He is singing in the wrong key”)

The critical point to understand is that LSA mixes and matches the data so that we can measure the true similarity of words regardless of the semantic issues mentioned above.

The LSA process begins with creating a Document-Term (Word) Matrix. This identifies the existence of a word within a given document and assigns it a weight using a method called Term Frequency-Inverse Document Frequency (TF-IDF).

Figure 3. Illustration of a Document-Term Matrix (, 2018)

Next, Singular Value Decomposition (SVD) is applied to the Document-Term Matrix in order to gain multiple perspectives related to topics and documents.

Figure 4. Illustration of Singular Value Decomposition Matrices (, 2018)

The output of LSA in its simplest form is simply a list of topics and the words associated with them.  For instance, if we were to do topic mining using data from the Associated Press, we would likely get topics like this:

Topic 1 (sports) – (“football”, “baseball”, “home run”, “touchdown”, “rain”)

Topic 2 (weather) – (“rainy”, “forecast”, “snow”, “rain”)

The diagram below demonstrates the use of a simple Heat Map to display the output of LSA, based on data from the Associated Press. Every column corresponds to a document and every row corresponds to a word. A cell stores the frequency of a word in a document, for example, darker cells indicate high word frequencies. In the case of this diagram, topics are darker patterns that are grouped tightly together (i.e. air, pollution, power, environmental).  An animated version by the original author is also available here.

Figure 5. Illustration of the use of Heat Map visualization LSA (, 2018)

The final topic model we will discuss is called Latent Dirichlet Allocation (LDA).  LDA is one of the most popular topic modeling techniques used today and is a significant improvement over LSA. It solves the problem of word ambiguity between two closely related topics by statistically comparing them to all other combinations of topics and documents that are relevant.

Let’s review our topic example from LSA.  I purposefully put the word “rain” in both topics.  LSA doesn’t do a particularly good job with these types of situations.

Topic 1 (sports) – (“football”, “baseball”, “home run”, “touchdown”, “rain”)

Topic 2 (weather) – (“rainy”, “forecast”, “snow”, “rain”)

LDA uses an iterative process based on statistical probability to determine which topic is the best “fit” for a word. The following is a very high-level summary describing the process of LDA:

  • Step 1 – Specify the number of topics

  • Step 2 – Randomly assign topics to each word in each document

  • Step 3 – Count the number of each word being assigned to each topic

  • Step 4 – Count the number of words assigned to each topic for each document

The algorithm goes through each word iteratively and reassigns the word to a topic based on:

  • The probability the word belongs to a topic

  • The probability that a document will be generated by a topic

The following diagram depicts a possible visualization of topic outputs from LDA. The bubble chart depicts the topics and that bar chart displays the words within the topic and their frequency.

Figure 6. Illustration of LDA Topic Mining Outcomes – Showing Python Library for Interactive Topic Model Visualization(pyLDAvis) (, 2018)

Further hands-on examples of LDA can be found at these sites:

Additional blog posts on text and advanced analytics concepts to follow; please contact if you have any questions or need further help!

#artificialintelligenceAI #topicmining #naturallanguageprocessingNLP #BI #analytics #sentimentanalysis #advancedanalytics #machinelearningML #datascience #textmining #businessintelligenceBI