Anomaly Detection


In this blog, we are going to understand what an anomaly is and what are some basic anomaly detection techniques.

Assumptions of Anomaly detection

The whole concept of Anomaly detection is based on two basic assumptions:

  • Their features differ from the normal instances significantly.

Directions of anomaly detection

In data analysis we have two directions to search for anomalies:

  • Novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier.

Reasons for the Outliers

The most common reason for the outliers are:

  • noise data points
  • hidden patterns in the dataset (fraud or attack requests)

Business use cases of Anomaly Detection

There are various business use cases where anomaly detection is useful.

  • The Credit Card Fraud Detection Systems (CCFDS)
  • Fault Detection
  • Event Detection in Sensor Networks
  • System Health Monitoring

Univariate vs Multivariate Anomaly Detection

  • In Univariate anomaly detection, we will measure a single indicator.
  • Multivariate analysis will consider a host of factors.

Most of the analysis that we end up doing is multivariate due to the complexity of the world we are living in. In multivariate anomaly detection, an outlier is a combined unusual score on at least two variables.

ML Algorithms for Anomaly Detection

Isolation Forest

One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

  • Each Decision Tree is built until the training dataset is exhausted.
  • A random feature and a random splitting are selected to build the new branch in the Decision Tree.
  • The algorithm separates normal points from outliers by the mean value of the depths of the Decision Tree leaves.
  • Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

Local Outlier Factor

Another efficient way to perform outlier detection on moderately high-dimensional datasets is to use the Local Outlier Factor (LOF) algorithm.

  • LOF compares the density of any given data point to the density of its neighbors.
  • This algorithm computes a score (called local outlier factor) which reflects the degree of abnormality of the observations.
  • Since outliers come from low-density areas, the ratio will be higher for anomalous data points.
  • As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have a much higher LOF.
  • The higher the LOF the more likely it is an outlier.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bhanu Shahi

Bhanu Shahi

Data Analyst at Decimal Tech | Machine Learning | NLP | Time Series | Python, Tableau & SQL Expert | Storyteller | Blogger