4 min read

Make the Most of Your Data with Anomaly Detection and Machine Learning

Featured Image

A while ago, my husband received a call from his credit card company asking him if he had made a gas purchase in Georgia.

He hadn’t. He’s never been to Georgia.

Fortunately, the credit card company was able to reverse the charge. I know my husband isn’t the first person to receive a call like this. Credit card companies make calls like this on a regular basis. They monitor purchasing habits and flag those purchases that are out of character.

Making the most of your data

With the prolific use of computers and smart devices, nearly everyone captures data, be it personal data or business-specific data. However, when it comes to businesses, capturing and using data is critical.

Like credit card companies, your company can use its data to detect changes in normal patterns of customer behavior, like purchasing gas in Georgia when you don’t live there and have never travelled there.

This is anomaly detection.

Anomaly detection can be used to improve application performance, product quality, user experience, and more.

Anomaly detection

According to the Oxford Dictionary, an anomaly is “something that deviates from what is standard, normal, or expected.”

Within a dataset are patterns that represent normal behavior. Anomaly detection identifies unexpected changes or events that don’t conform to the expected data pattern. In business terms, an anomaly is a deviation from business as usual.

Anomalies aren’t categorically good or bad, they’re just deviations from the expected value for a metric at a given point in time.

Traditionally, anomaly detection was a manual process completed by experienced data scientists. However, as datasets continue to increase in size and complexity, machine learning has quickly become a more viable option.

Machine learning

According to IBM, “machine learning is a branch of artificial intelligence (AI) focused on building applications that learn from data and improve their accuracy over time without being programmed to do so.”

There are three main machine learning methods (or styles): supervised learning, unsupervised learning, and semi-supervised learning.

Supervised learning

Most practical machine learning uses supervised learning.

Supervised machine learning uses a training data set of predefined input variables and matching, expected outcomes. The algorithm uses the training data set to learn how to classify data or predict outcomes accurately.

A supervised learning algorithm, at its most basic form, maps function from input variables (x) to an output variable (y).

y = f(x)

The goal is to approximate the mapping function so well that when you have new input data (x) you can predict the output variables (y) for that data.

It’s called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. The algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems are grouped into regression and classification problems.

  • Classification: Uses an algorithm to assign the output variable into categories, such as “groceries” or “travel” or “disease” and “no disease.”
  • Regression: Is used to understand the relationship between dependent and independent variables. The output variable is a real value, such as “dollars” or “weight.”

Unsupervised learning

Unlike supervised learning, unsupervised learning uses input data (x) only and no corresponding output variables.

The goal for unsupervised learning is to use algorithms to analyze and cluster unlabeled datasets. And as implied in the name, it doesn’t require user supervision. While working on its own, the algorithms learn what is normal, and then apply a statistical test to determine if a specific data point is an anomaly.

Unsupervised learning problems are grouped into clustering and association problems.

  • Clustering: The inherent groupings in the data, such as grouping customers by purchasing behavior.
  • Association: Rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Semi-supervised learning

Semi-supervised learning is, as you can probably guess, a mix of supervised and unsupervised learning. It includes a supervised learning algorithm for the task with the ability to train your model without labeling every training example.

 

Anomaly detection: Semi-supervised learning

When it comes to anomaly detection, implementing semi-supervised learning is the preferred option.

Most data classifications should be done in an unsupervised manner (without human interaction). However, you should still have the option to feed algorithms with datasets valuable to creating baselines of business-as-usual behavior. A hybrid approach ensures that you can scale anomaly detection with the flexibility to make manual rules regarding specific anomalies.

Summary

Anomaly detection, when combined with supervised and unsupervised machine learning, is one of several ways to make the most of your data.