Analyzing large amounts of unstructured text is a capability at the heart of most activities in the data science and advanced analytics fields today. In this first installment of a blog series, I introduce the topic of text analytics and describe how it relates to this evolving marketplace.
When did data science become a “hot topic?”
Searches for the term “data science” on Google are currently at an all-time high, according to Google Trends. Furthermore, searches for the term “business intelligence” or “BI” seems to be on a slow and steady decline. If we look closely at the graph below from Google Trends, we can see a crossover of these two terms happening in 2015, followed by a sharp increase in the term “data science” thereafter.
What made the year 2015 so important for the data science field? Four key things happened that year:
1. Speech recognition experienced a major scientific breakthrough
According to the article “A Brief History of Data Science” at dataversity.com, “In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent.”
2. Increasing investment in, and democratization of, of artificial intelligence (AI) and machine learning (ML) provided wider access to data science capabilities
At the end of 2015, Bloomberg author Jack Clark wrote that it had been a landmark year for artificial intelligence (AI). For example, “Within Google, the total of software projects using AI increased from ‘sporadic’ to more than 2,700 projects over the year.”
3. Dramatically less expensive solutions were released
In the same article, Clark points out that “There are…more plentiful datasets and free or inexpensive software development tools for researchers to work with. Thanks to this, a crucial class of learning technology, known as neural networks, has gone from being prohibitively expensive to relatively cheap.”
4. Data science software (finally) produced highly accurate results
Lastly, in the same article, Jack points out that, “In tests, error rates are down to about five percent, roughly on par with a human being’s performance.”
“Advanced analytics” in the data science context
The activities involved in data science today are most often described using the term “advanced analytics.” Performance Architects recommends using the generally accepted definition of advanced analytics from Gartner:
“The autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, to discover deep insights, make predictions, or generate recommendations.” These include:
Network and cluster analysis
Complex event processing
In summary, the concept of advanced analytics refers to a set of ideas and advanced techniques to gain deeper insight into data, often using data that is difficult to process.
The following Euler Diagram from datascience.com provides a great representation of the overlap in the spectrum of topics between traditional BI and advanced analytics. The timeline demonstrates that advanced analytics focuses more on answering questions about future outcomes and behaviors whereas traditional BI offers historical data analysis.
Most advanced analytics efforts focus on understanding data that does not conform to traditional relational database (RDBMS) structures and requires more processing to extract meaningful information.
The diagram below from researchgate.com demonstrates the types of data involved in enterprise analytics today, and adds the dimensions of volume, variety, and velocity. As we move away from the darker shades, traditional RDBMS systems can no longer comprehend or process data effectively.
Call center logs, user surveys, Twitter feeds, and blogs would meet all requirements for being a candidate for advanced analytics. What is common among most of the items circled in the diagram above is that they are comprised of text data.
Additional blog posts on text and advanced analytics concepts to follow; please contact email@example.com if you have any questions or need further help!