If you want to be an anomaly, you have to start acting like one…

Introduction

In this blog, we are going to understand what an anomaly is and what are some basic anomaly detection techniques.

So, let’s start with the understanding of anomaly.

An anomaly is something that deviates from what is standard, normal, or expected. We often have to check for anomalies while doing data analysis.

The main goal of Anomaly Detection analysis is to identify the observations that do not adhere to general patterns considered as normal behavior.

Anomaly detection is often applied on unlabeled data which is known as unsupervised anomaly detection.

Assumptions of Anomaly detection

The whole concept of Anomaly detection is based on two basic assumptions:

  • Anomalies only occur very rarely in the data.
  • Their features differ from the normal instances significantly.

Directions of anomaly detection

In data analysis we have two directions to search for anomalies:

  • Outlier detection: The outlier is the observation that differs from other data points in the training dataset.
  • Novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier.

Reasons for the Outliers

The most common reason for the outliers are:

  • data errors (measurement inaccuracies, rounding, incorrect writing, etc.)
  • noise data points
  • hidden patterns in the dataset (fraud or attack requests)

So outlier processing depends on the nature of the data and the domain. Noise data points should be filtered (noise removal); data errors should be corrected.

Business use cases of Anomaly Detection

There are various business use cases where anomaly detection is useful.

  • Intrusion Detection Systems (IDS)
  • The Credit Card Fraud Detection Systems (CCFDS)
  • Fault Detection
  • Event Detection in Sensor Networks
  • System Health Monitoring

Univariate vs Multivariate Anomaly Detection

  • In Univariate anomaly detection, we will measure a single indicator.
  • Multivariate analysis will consider a host of factors.

Most of the analysis that we end up doing is multivariate due to the complexity of the world we are living in. In multivariate anomaly detection, an outlier is a combined unusual score on at least two variables.

ML Algorithms for Anomaly Detection

Isolation Forest

One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

  • The isolation Forests method is based on the random implementation of the Decision Trees and other results ensemble.
  • Each Decision Tree is built until the training dataset is exhausted.
  • A random feature and a random splitting are selected to build the new branch in the Decision Tree.
  • The algorithm separates normal points from outliers by the mean value of the depths of the Decision Tree leaves.
  • Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

Local Outlier Factor

Another efficient way to perform outlier detection on moderately high-dimensional datasets is to use the Local Outlier Factor (LOF) algorithm.

What is the Local Outlier Factor (LOF)?

  • LOF is an unsupervised (well, semi-supervised) machine learning algorithm that uses the density of data points in the distribution as a key factor to detect outliers.
  • LOF compares the density of any given data point to the density of its neighbors.
  • This algorithm computes a score (called local outlier factor) which reflects the degree of abnormality of the observations.
  • Since outliers come from low-density areas, the ratio will be higher for anomalous data points.
  • As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have a much higher LOF.
  • The higher the LOF the more likely it is an outlier.

That’s all about this article. Give a clap and comment your thoughts if you enjoyed this article.

Aspiring Data Scientist | Machine Learning | NLP | Time Series | Python, Tableau & SQLExpert | Storyteller | Blogger | Data Science Trainee at AlmaBetter