If you want to be an anomaly, you have to start acting like one…


So, let’s start with the understanding of anomaly.

An anomaly is something that deviates from what is standard, normal, or expected. We often have to check for anomalies while doing data analysis.

The main goal of Anomaly Detection analysis is to identify the observations that do not adhere to general patterns considered as normal behavior.

Anomaly detection is often applied on unlabeled data which is known as unsupervised anomaly detection.

Assumptions of Anomaly detection

  • Anomalies only occur very rarely in the data.
  • Their features differ from the normal instances significantly.

Directions of anomaly detection

  • Outlier detection: The outlier is the observation that differs from other data points in the training dataset.
  • Novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier.

Reasons for the Outliers

  • data errors (measurement inaccuracies, rounding, incorrect writing, etc.)
  • noise data points
  • hidden patterns in the dataset (fraud or attack requests)

So outlier processing depends on the nature of the data and the domain. Noise data points should be filtered (noise removal); data errors should be corrected.

Business use cases of Anomaly Detection

  • Intrusion Detection Systems (IDS)
  • The Credit Card Fraud Detection Systems (CCFDS)
  • Fault Detection
  • Event Detection in Sensor Networks
  • System Health Monitoring

Univariate vs Multivariate Anomaly Detection

  • Multivariate analysis will consider a host of factors.

Most of the analysis that we end up doing is multivariate due to the complexity of the world we are living in. In multivariate anomaly detection, an outlier is a combined unusual score on at least two variables.

ML Algorithms for Anomaly Detection

Isolation Forest

  • The isolation Forests method is based on the random implementation of the Decision Trees and other results ensemble.
  • Each Decision Tree is built until the training dataset is exhausted.
  • A random feature and a random splitting are selected to build the new branch in the Decision Tree.
  • The algorithm separates normal points from outliers by the mean value of the depths of the Decision Tree leaves.
  • Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

Local Outlier Factor

What is the Local Outlier Factor (LOF)?

  • LOF is an unsupervised (well, semi-supervised) machine learning algorithm that uses the density of data points in the distribution as a key factor to detect outliers.
  • LOF compares the density of any given data point to the density of its neighbors.
  • This algorithm computes a score (called local outlier factor) which reflects the degree of abnormality of the observations.
  • Since outliers come from low-density areas, the ratio will be higher for anomalous data points.
  • As a rule of thumb, a normal data point has a LOF between 1 and 1.5 whereas anomalous observations will have a much higher LOF.
  • The higher the LOF the more likely it is an outlier.

That’s all about this article. Give a clap and comment your thoughts if you enjoyed this article.

Aspiring Data Scientist | Machine Learning | NLP | Time Series | Python, Tableau & SQLExpert | Storyteller | Blogger | Data Science Trainee at AlmaBetter