Identifying Clusters in Data Sets Using Oracle Data Visualization
Oracle Data Visualization (DV) offers several built-in advanced analytics options that can help an analyst who isn’t a data scientist to explore patterns and trends in data sets. One specific feature that is available is the ability to add identifiable clustering to any visualization to find segments of the data that are highly related.
Clustering is typically considered an “Unsupervised Learning” analysis method. The differences between unsupervised and supervised learning analysis methods include:
Unsupervised Learning: Find the structure or relationships between different input data values without any pre-determined classification, using statistical methods related to distance or hierarchies within the data.
Supervised Learning: Use an existing data to train a model that can be used to classify other similar data sets. The algorithm seeks to develop a function from the inputs to identify the respective targets. If the target is one of a set of categorical values, it is called a “classification problem.” Or, if the target space is a continuous range, it is called a “regression problem.”
Some use cases for clustering and segmentation include:
Segment by purchase history
Segment by interaction with applications or website
Define common “personas” based on interests for targeted marketing
Group inventory by sales activity
Group inventory by manufacturing key performance indicators (KPIs)
Interpreting sensor measurements
Detect activity types in motion sensors
Group images (facial recognition)
Identify groups via medical instrument monitoring
Oracle DV offers two options for clustering: “K-Means” and “Hierarchical.” The differences between these two options are highlighted below:
The following diagram describes the process used by the K-Means option to group similar data points into clusters:
Here’s an example from Oracle DV canvas where clusters have been added by clicking on the visualization options in the upper right corner and then selecting “Add Clusters.” The default is to use five clusters with the K-Means method, but those options can be changed in the panel on the lower left of the DV interface.
This visualization shows campaign donation by amount and average per person by zip code. The clusters of zip codes could be tied back to voting data to see if there are other patterns related to donations versus the candidate vote totals:
It is a worthwhile exercise to test different options for clustering on-the-fly with Oracle DV to determine the best way to segregate the data set. Clustering then becomes the kickoff point for doing further in-depth analysis to help drive business decisions around targeting customers, delivering tailored marketing campaigns, or any other activity that can be derived from finding similar cohorts within a data set.