The Fundamentals of Machine Learning Using Oracle Analytics Cloud (OAC)
Oracle Analytics Cloud (OAC) is a complete platform offering that spans the analytical requirements of an enterprise from IT-governed data models to self-service exploration capabilities connected to a wide range of data sources.
One of the key components of OAC is its comprehensive, yet user-friendly, ability to use machine learning (ML) techniques to train and apply models to datasets. This allows us to gain insight and predictive capabilities that go beyond regular business intelligence analysis.
The purpose of this blog post is to provide background information on machine learning in general, as well as to describe how to utilize those capabilities within OAC.
Machine Learning Background
Machine learning is a field of computer science that focuses on developing and evaluating algorithms that identify meaningful patterns from data. The algorithm reads a dataset, applies statistical functions to it, and returns a software model that stores the patterns. That model can then be applied to another dataset to predict outcomes either as a number within a range of values, a binary true/false identifier, or a classification result within a group of defined attribute values.
Oracle offers three different methods of supervised machine learning out-of-the-box in the Data Visualization component of OAC. Supervised learning depends on using a representative dataset for training a model using a set of predefined attributes and a target column. The three options include: Numeric Prediction, Multi-Classifier, and Binary Classifier.
Numeric Prediction is typically referred to as a regression technique used to predict a value within a continuous range.
Predicting store sales based on location, surrounding demographics, and nearest competition
Estimating what price a home will sell for based on recent sales, square footage, and age
Determining how many days until a customer will return to a store based on most recent purchases or demographics obtained from credit card account
Common algorithms include:
Binary Prediction determines values that can only have one of two states, typically “true” or “false.”
Identifying employees who are most likely to leave a company
Predicting if a subscription holder will renew or not
Deciding if a web page belongs in a search result
Common algorithms include:
Multi-Class Prediction predicts values that belong to a limited, predefined set of permissible values.
Predicting susceptibility levels for certain diseases
Image recognition to classify objects in a photo
Predicting which component of a machine will most likely fail first
Common algorithms include:
The Machine Learning Process
The typical process for creating and applying a machine learning model includes the following steps:
Formulate a question
This process is represented in the following diagram, known as the CRISP-DM Process (cross-industry process for data mining).
The first step is to formulate a question to be answered by machine learning.
Begin with the end in mind – what is the objective?
What information are you seeking that will improve your organization in a meaningful way?
Ask a specific question with a measurable answer (or, avoid questions with vague answers)
Decide whether your question fits into a numeric, binary, or multi-classification model
If possible, reformulate a question from binary to numeric (change the question from “true or false” to a numeric range to improve confidence in the answer)
Step two is to acquire the data required to answer the question.
The most critical component of any machine learning process is the dataset used to train the model. The time spent on data acquisition, profiling, cleansing, and attribute selection will likely exceed the time spent on all other phases of the machine learning process. Choosing the right attributes can greatly improve the accuracy and performance of a model. Simplicity can lead to better accuracy and improved ability to explain the model to others.
Next, in step three, cleanse the data to make it more efficient for training the machine learning model.
Data set cleansing and profiling can take a variety of forms. Oracle DV offers a simple way to do that by allowing for importing data directly from a spreadsheet into a local environment. Using its own embedded machine learning models to examine the data, DV can offer suggestions for enhancements for the data to combine, deconstruct, or apply formulas to certain columns. The example below shows how OAC can add useful information to a dataset that has a “City” column, including adding the population, latitude, longitude, and other data elements.
In addition, column formulas can be applied on the fly to the dataset with a simple right-click and selection of any of the functions below:
With any new data set, a significant amount of time should be spent understanding the contents of each data column. Review all values for consistency of data types, missing values, and outliers. If there are outliers, use the following guidelines to determine how to handle them:
Once the data set is cleansed, step four is to train the machine learning model. The purpose of training a machine learning model is to take a sample dataset with labeled data labels and process it through a script to generate a model that can be applied to other data sets for prediction and scoring purposes. In Oracle DV, this process can be accomplished using built in data flows and pre-defined machine learning models
The steps are very simple:
Create a new DV data flow
Select the data set for training the model
Add a new “Train Model” step to the data flow and select the type (Binary, Numeric, or Multi Classification), then choose the method within the model type
Select a target data column and set the parameters for the model accordingly
Save the model with an appropriate name and execute the data flow
In this example, the “Train Numeric Prediction” step is added to a data flow.
Then, the “Linear Regression” script is chosen with the “Sales” metric as the numeric “Target” column to predict.
After saving and executing the data flow, a new machine learning model is created and available for the fifth step in the process: applying the model. Once a model has been trained, make sure to apply that model to a different data set in order to predict a target column value.
In Oracle DV, this process can be accomplished using built in data flows and applying a model.
The steps are very simple:
Create a new data flow
Select the data set to apply the model on
Add a new “Apply Model” step to the data flow and select the model from the list of available models
Assign a target data column and set the parameters for the model
Save the data flow with an appropriate name and execute the data flow
After applying the model to a data set, the sixth step is to analyze the results. There is no magic bullet with machine learning. There will be no “perfect” model that guarantees 100% prediction accuracy. You should expect to train and execute models multiple times using different parameters and different model script types to assess which combination produces the optimum model.
When training a model, a subset of the training data is not included during input processing in the model. This retained data is used to score the model to determine the accuracy. Metrics like “Accuracy” and “Precision” are calculated based on actual values for the target column compared to the predicted value from the model. In DV, there is an option to inspect a ML model to assess its accuracy. The example below shows various metrics for the model (in this case, a numeric prediction model).
The “Mean Absolute Error” of your model refers to the mean of the absolute values of each prediction error on all instances of the test data set
The prediction error is the difference between the actual value and the predicted value for that instance
As stated before, no machine learning model is perfect. If it was, it would likely be possible to formulate the same prediction without requiring any machine learning. You would not use machine learning to predict whether a ball would rise or fall if dropped from a building. For that reason, use machine learning when the complexity of the question goes beyond simple analysis techniques.
What can affect machine learning outcomes?
Training sample size
Correlation versus causation
The human element
The last bullet is the one that may cause the most difficulty in creating a useful machine learning model. There are two different ways humans can affect the machine learning process:
First, the quality of the model will likely be a function of the subject matter experience and data analysis skills of the person creating it. Second, if the data is related to human experiences, there are numerous factors that may make it difficult to find a representative sample data set. Genetics, social structures and influences, diet, regional differences, etc. can all influence sample data. Be sure to try and account for these factors when selecting data and creating models.
Machine learning can provide useful insight to your organization. If you see a possible opportunity to utilize OAC, the solution offers a very easy way to get started quickly with the process. Oracle also offers a very useful training course on Udemy, entitled “Udemy Oracle OAC Machine Learning Course” that provides hands on experience on how to use ML within the OAC DV environment.
Interested in learning more about our experience with machine learning in an OAC environment? Contact firstname.lastname@example.org and we’d be happy to share our knowledge with you.