Anomaly detection

In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.^[1] Such examples may arouse suspicions of being generated by a different mechanism,^[2] or appear inconsistent with the remainder of that set of data.^[3]

Anomaly detection finds application in many domains including cybersecurity, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers.

Three broad categories of anomaly detection techniques exist.

normal behavior

from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

Definition

Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include the following, and can be categorised into three groups: those that are ambiguous, those that are specific to a method with pre-defined thresholds usually chosen empirically, and those that are formally defined:

Ill defined

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.^[2]
Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data.
An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.^[3]
An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features.
Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour.^[1]

Specific

Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier.

Definition of anomalies in high-dimensional context

In this big data era, the focus is increasingly on methodologies capable of handling the complexity and scale of data, going beyond traditional approaches to define and detect anomalies in a way that is both effective and efficient for today's data-driven decision-making processes.^[4]

Anomalies in high-dimensional spaces are more challenging to identify due to the sparsity of the data and the relative distance between points becoming less meaningful.^[4]
Traditional threshold-based methods become less effective as dimensionality increases, often requiring more sophisticated, multidimensional analysis techniques.^[4]
High dimensional anomaly detection often requires careful consideration of the feature selection to reduce dimensionality and enhance the sensitivity to true anomalies.^[4]

History

Intrusion detection

The concept of intrusion detection, a critical component of anomaly detection, has evolved significantly over time. Initially, it was a manual process where system administrators would monitor for unusual activities, such as a vacationing user's account being accessed or unexpected printer activity. This approach was not scalable and was soon superseded by the analysis of audit logs and system logs for signs of malicious behavior.^[5]

By the late 1970s and early 1980s, the analysis of these logs was primarily used retrospectively to investigate incidents, as the volume of data made it impractical for real-time monitoring. The affordability of digital storage eventually led to audit logs being analyzed online, with specialized programs being developed to sift through the data. These programs, however, were typically run during off-peak hours due to their computational intensity.^[5]

The 1990s brought the advent of real-time intrusion detection systems capable of analyzing audit data as it was generated, allowing for immediate detection of and response to attacks. This marked a significant shift towards proactive intrusion detection.^[5]

As the field has continued to develop, the focus has shifted to creating solutions that can be efficiently implemented across large and complex network environments, adapting to the ever-growing variety of security threats and the dynamic nature of modern computing infrastructures.^[5]

Applications

Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security,

fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision, medical diagnosis and law enforcement.^[6]

Intrusion detection

Anomaly detection was proposed for

intrusion detection is misuse detection

.

Fintech fraud detection

Anomaly detection is vital in fintech for fraud prevention.^[10]^[11]

Preprocessing

Preprocessing data to remove anomalies can be an important step in data analysis, and is done for a number of reasons. Statistics such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy.^[12]^[13]

Video surveillance

Anomaly detection has become increasingly vital in video surveillance to enhance security and safety.[14]^[15] With the advent of deep learning technologies, methods using Convolutional Neural Networks (CNNs) and Simple Recurrent Units (SRUs) have shown significant promise in identifying unusual activities or behaviors in video data.^[14] These models can process and analyze extensive video feeds in real-time, recognizing patterns that deviate from the norm, which may indicate potential security threats or safety violations.^[14]

IT infrastructure

In IT infrastructure management, anomaly detection is crucial for ensuring the smooth operation and reliability of services.^[16] Techniques like the IT Infrastructure Library (ITIL) and monitoring frameworks are employed to track and manage system performance and user experience.^[16] Detection anomalies can help identify and pre-empt potential performance degradations or system failures, thus maintaining productivity and business process effectiveness.^[16]

IoT systems

Anomaly detection is critical for the security and efficiency of Internet of Things (IoT) systems.^[17] It helps in identifying system failures and security breaches in complex networks of IoT devices.^[17] The methods must manage real-time data, diverse device types, and scale effectively. Garbe et al.^[18] have introduced a multi-stage anomaly detection framework that improves upon traditional methods by incorporating spatial clustering, density-based clustering, and locality-sensitive hashing. This tailored approach is designed to better handle the vast and varied nature of IoT data, thereby enhancing security and operational reliability in smart infrastructure and industrial IoT systems.^[18]

Petroleum industry

Anomaly detection is crucial in the petroleum industry for monitoring critical machinery.^[19] Martí et al. used a novel segmentation algorithm to analyze sensor data for real-time anomaly detection.^[19] This approach helps promptly identify and address any irregularities in sensor readings, ensuring the reliability and safety of petroleum operations.^[19]

Oil and gas pipeline monitoring

In the oil and gas sector, anomaly detection is not just crucial for maintenance and safety, but also for environmental protection.^[20] Aljameel et al. propose an advanced machine learning-based model for detecting minor leaks in oil and gas pipelines, a task traditional methods may miss.^[20]

Methods

Many anomaly detection techniques have been proposed in literature.^[1]^[21] The performance of methods usually depend on the data sets. For example, some may be suited to detecting local outliers, while others global, and methods have little systematic advantages over another when compared across many data sets.^[22]^[23] Almost all algorithms also require the setting of non-intuitive parameters critical for performance, and usually unknown before application. Some of the popular techniques are mentioned below and are broken down into categories:

Statistical

Parameter-free

Parametric-based

Density

Density-based techniques (
k-nearest neighbor,^[24]^[25]^[26] local outlier factor,^[27] isolation forests,^[28]^[29] and many more variations of this concept^[30]
)

Subspace-,[31] correlation-based^[32] and tensor-based ^[33] outlier detection for high-dimensional data^[34]
One-class
support vector machines^[35]

Neural networks

Replicator neural networks,^[36] autoencoders, variational autoencoders,^[37] long short-term memory neural networks^[38]
Bayesian networks^[36]
Hidden Markov models (HMMs)^[36]
Minimum Covariance Determinant^[39]^[40]
Deep Learning^[14]
- Convolutional Neural Networks (CNNs): CNNs have shown exceptional performance in the unsupervised learning domain for anomaly detection, especially in image and video data analysis.^[14] Their ability to automatically and hierarchically learn spatial hierarchies of features from low to high-level patterns makes them particularly suited for detecting visual anomalies. For instance, CNNs can be trained on image datasets to identify atypical patterns indicative of defects or out-of-norm conditions in industrial quality control scenarios.^[41]
- Simple Recurrent Units (SRUs): In time-series data, SRUs, a type of recurrent neural network, have been effectively used for anomaly detection by capturing temporal dependencies and sequence anomalies.^[14] Unlike traditional RNNs, SRUs are designed to be faster and more parallelizable, offering a better fit for real-time anomaly detection in complex systems such as dynamic financial markets or predictive maintenance in machinery, where identifying temporal irregularities promptly is crucial.^[42]

Cluster-based

Clustering: Cluster analysis-based outlier detection^[43]^[44]
Deviations from association rules and frequent itemsets
Fuzzy logic-based outlier detection

Ensembles

Ensemble techniques, using feature bagging,^[45]^[46] score normalization^[47]^[48] and different sources of diversity^[49]^[50]

Others

Anomaly detection in dynamic networks

Dynamic networks, such as those representing financial systems, social media interactions, and transportation infrastructure, are subject to constant change, making anomaly detection within them a complex task. Unlike static graphs, dynamic networks reflect evolving relationships and states, requiring adaptive techniques for anomaly detection.

Types of anomalies in dynamic networks

Community anomalies
Compression anomalies
Decomposition anomalies
Distance anomalies
Probabilistic model anomalies

Explainable anomaly detection

Many of the methods discussed above only yield an anomaly score prediction, which often can be explained to users as the point being in a region of low data density (or relatively low density compared to the neighbor's densities). In explainable artificial intelligence, the users demand methods with higher explainability. Some methods allow for more detailed explanations:

The Subspace Outlier Degree (SOD)^[31] identifies attributes where a sample is normal, and attributes in which the sample deviates from the expected.
Correlation Outlier Probabilities (COP)^[32] compute an error vector of how a sample point deviates from an expected location, which can be interpreted as a counterfactual explanation: the sample would be normal if it were moved to that location.

Software

ELKI is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them.
PyOD is an open-source Python library developed specifically for anomaly detection.^[51]
scikit-learn is an open-source Python library that contains some algorithms for unsupervised anomaly detection.
Wolfram Mathematica provides functionality for unsupervised anomaly detection across multiple data types ^[52]

Datasets

Anomaly detection benchmark data repository with carefully chosen data sets of the
Ludwig-Maximilians-Universität München; Mirror Archived 2022-03-31 at the Wayback Machine at University of São Paulo
.

ODDS – ODDS: A large collection of publicly available outlier detection datasets with ground truth in different domains.
Unsupervised Anomaly Detection Benchmark at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.
KMASH Data Repository at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.

References

^
S2CID 207172599
.

^
OCLC 6912274
.

^
OCLC 1150938591
.

^
ISSN 2196-1115
.

^
ISSN 0018-9162
.

ISBN 978-3319475776
.

S2CID 10028835. Archived
(PDF) from the original on June 22, 2015.

S2CID 35632142
.

^ Jones, Anita K.; Sielken, Robert S. (2000). "Computer System Intrusion Detection: A Survey". Computer Science Technical Report. Department of Computer Science, University of Virginia: 1–25}.

PMID 33668773
.

S2CID 204982937
.

doi:10.1109/TSMC.1976.4309523
.

S2CID 5809822
.

^
S2CID 257728239
.

S2CID 12310150
.

^
S2CID 3883483
. Retrieved 2023-11-08.

^
S2CID 250644468
.

^
S2CID 204077191
.

^
PMID 25633599
.

^
ISSN 2079-3197
.

S2CID 53305944. Archived from the original
(PDF) on 2021-11-14. Retrieved 2019-12-09.

S2CID 1952214
.

Ludwig-Maximilians-Universität München; Mirror Archived 2022-03-31 at the Wayback Machine at University of São Paulo
.

S2CID 11707259
.

ISBN 1-58113-217-4
.

ISBN 978-3-540-44037-6
.

ISBN 1-58113-217-4
.

S2CID 6505449
.

S2CID 207193045
.

S2CID 19036098
.

^
ISBN 978-3-642-01306-5
.

^
ISBN 978-1-4673-4649-8
.

S2CID 16368060
.

S2CID 6724536
.

S2CID 2110475
.

^
S2CID 6436930
.

^ An, J.; Cho, S. (2015). "Variational autoencoder based anomaly detection using reconstruction probability" (PDF). Special Lecture on IE. 2 (1): 1–18. SNUDM-TR-2015-03.

ISBN 978-2-87587-015-5
.

S2CID 67227041
.

S2CID 123086172
.

PMID 33816053
.

PMID 36905048
.

doi:10.1016/S0167-8655(03)00003-5
.

S2CID 2887636
.

S2CID 2054204
.

ISBN 978-3-642-12025-1
.

ISBN 978-0-89871-992-5
.

ISBN 978-1-61197-232-0
.

S2CID 8065347
.

ISBN 978-1-4503-2722-0
.

arXiv:1901.01588
.

^ "FindAnomalies". Mathematica documentation.

v
t
e
Information security
Related security categories

Computer security

Automotive security

Cybercrime
Cybersex trafficking

Computer fraud

Cybergeddon

Cyberterrorism

Cyberwarfare

Electromagnetic warfare

Information warfare

Internet security

Mobile security

Network security

Copy protection

Digital rights management

vectorial version
Threats

Adware

Advanced persistent threat

Arbitrary code execution

Backdoors

Hardware backdoors

Code injection

Crimeware

Cross-site scripting

Cross-site leaks

DOM clobbering

History sniffing

Cryptojacking

Botnets

Data breach

Drive-by download

Browser Helper Objects

Viruses

Data scraping

Denial-of-service attack

Eavesdropping

Email fraud

Email spoofing

Exploits

Hacktivism

Insecure direct object reference

Keystroke loggers

Logic bombs

Time bombs

Fork bombs

Zip bombs

Fraudulent dialers

Malware

Payload

Phishing
Voice

Polymorphic engine

Privilege escalation

Ransomware

Rootkits

Scareware

Shellcode

Spamming

Social engineering

Spyware

Software bugs

Trojan horses

Hardware Trojans

Remote access trojans

Vulnerability

Web shells

Wiper

Worms

SQL injection

Rogue security software

Zombie

Defenses

Application security
Secure coding

Secure by default

Secure by design
Misuse case

Computer access control
Authentication
Multi-factor authentication

Authorization

Computer security software
Antivirus software

Security-focused operating system

Data-centric security

Obfuscation (software)

Data masking

Encryption

Firewall

Intrusion detection system
Host-based intrusion detection system (HIDS)

Anomaly detection

Security information and event management (SIEM)

Mobile secure gateway

Runtime application self-protection

Site isolation

Authority control databases: National

Israel

United States

Retrieved from "https://en.wikipedia.org/w/index.php?title=Anomaly_detection&oldid=1220713426"

[ChandolaSurvey-1] 
S2CID 207172599
.

[Hawkins_1980-2] 
OCLC 6912274
.

[Outliers_in_statistical_data-3] 
OCLC 1150938591
.

[Thudumu-2020-4] 
ISSN 2196-1115
.

[Kemmerer-2002-5] 
ISSN 0018-9162
.

[6] ISBN 978-3319475776
.

[7] S2CID 10028835. Archived
(PDF) from the original on June 22, 2015.

[8] S2CID 35632142
.

[9] Jones, Anita K.; Sielken, Robert S. (2000). "Computer System Intrusion Detection: A Survey". Computer Science Technical Report. Department of Computer Science, University of Virginia: 1–25}.

[10] PMID 33668773
.

[11] S2CID 204982937
.

[12] :10.1109/TSMC.1976.4309523
.

[13] S2CID 5809822
.

[Qasim-2023-14] 
S2CID 257728239
.

[15] S2CID 12310150
.

[Gow-2018-16] 
S2CID 3883483
. Retrieved 2023-11-08.

[Chatterjee-2022-17] 
S2CID 250644468
.

[Garg-2020-18] 
S2CID 204077191
.

[Martí-2015-19] 
PMID 25633599
.

[Aljameel-2022-20] 
ISSN 2079-3197
.

[ZimekFilzmoser2018-21] S2CID 53305944. Archived from the original
(PDF) on 2021-11-14. Retrieved 2019-12-09.

[CamposZimek2016-22] S2CID 1952214
.

[23] Ludwig-Maximilians-Universität München; Mirror Archived 2022-03-31 at the Wayback Machine at University of São Paulo
.

[24] S2CID 11707259
.

[25] ISBN 1-58113-217-4
.

[26] ISBN 978-3-540-44037-6
.

[27] ISBN 1-58113-217-4
.

[28] S2CID 6505449
.

[29] S2CID 207193045
.

[30] S2CID 19036098
.

[Kriegel-2009-31] 
ISBN 978-3-642-01306-5
.

[Kriegel-2012-32] 
ISBN 978-1-4673-4649-8
.

[33] S2CID 16368060
.

[34] S2CID 6724536
.

[35] S2CID 2110475
.

[replicator-36] 
S2CID 6436930
.

[37] An, J.; Cho, S. (2015). "Variational autoencoder based anomaly detection using reconstruction probability" (PDF). Special Lecture on IE. 2 (1): 1–18. SNUDM-TR-2015-03.

[38] ISBN 978-2-87587-015-5
.

[39] S2CID 67227041
.

[40] S2CID 123086172
.

[41] PMID 33816053
.

[42] PMID 36905048
.

[43] :10.1016/S0167-8655(03)00003-5
.

[44] S2CID 2887636
.

[45] S2CID 2054204
.

[46] ISBN 978-3-642-12025-1
.

[47] ISBN 978-0-89871-992-5
.

[48] ISBN 978-1-61197-232-0
.

[49] S2CID 8065347
.

[50] ISBN 978-1-4503-2722-0
.

[51] rXiv:1901.01588
.

[52] "FindAnomalies". Mathematica documentation.

[1]

[2]

[3]

[4]

[5]

[6]

[10]

[11]

[12]

[13]

[15]

[14]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[31]

[51]

[52]

Definition

Ill defined

Specific

Definition of anomalies in high-dimensional context

History

Intrusion detection

Applications

Intrusion detection

Fintech fraud detection

Preprocessing

Video surveillance

IT infrastructure

IoT systems

Petroleum industry

Oil and gas pipeline monitoring

Methods

Statistical

Parameter-free

Parametric-based

Density

Neural networks

Cluster-based

Ensembles

Others

Anomaly detection in dynamic networks

Types of anomalies in dynamic networks

Explainable anomaly detection

Software

Datasets

See also

References