How effective are big data technologies for Intrusion Detection ?

Add to my custom PDF

A survey on Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection

Cybersecurity is a set of technologies and processes designed to protect computers, network, programs and data from adverse cyber incidents. To do so, Intrusion Detection Systems (IDS) can identify unauthorized use, duplication, alteration and destruction of information systems. These systems employ machine learning (ML) or data mining (DM) methods to recognize known signatures of malicious activities, as well as to identify deviations from normal behaviors. This paper provides a deeper understanding of ML and DM techniques, by overviewing some popular and emerging cybersecurity methods.

Buczak and Guven (2015) review the literature on machine learning (ML) and data mining (DM) methods for cyber analytics in support of intrusion detection in both wired and wireless networks. The authors focused on highly cited papers and on recent papers presenting emerging methods.

ML focuses on classification and prediction, based on known properties previously learned from the training data. DM is the application of specific algorithms for extracting patterns from the data. ML needs a specific goal, whereas DM focuses on the discovery of previously unknown properties in the data. As both methods employ similar statistical techniques the authors label the methods as ML/DM methods. The methods discussed in the article are described in the table.

The authors conclude that the most effective ML/DM methods for the cyber domain have not yet been established. Given the richness and complexity of each method, one recommendation for each method, based on the types of attack a system is supposed to detect, cannot be accomplished. Several criteria need to be taken into account when considering different methods: accuracy, complexity, time for classifying a threat and the understandability of the outcomes for each ML/DM method. The authors also emphasize streaming capabilities are essential in the cyber domain, as cyberattacks require real-time analysis of online data.

The authors find that the cyber domain has peculiarities that make these ML/DM methods harder to use. Models need to be retrained frequently, depending on new intrusions. Yet, cybersecurity research on retraining of ML/DM models is limited, due to the scarcity of good datasets. The best dataset available so far, for testing ML/DM methods, is the 1999 corrected datasets of the Knowledge Discovery in Databases (KDD). Investing in the collection of up-to-date representative data, by for example putting sensors on networks, could foster the creation of more efficient tools that detect cyber threats in information systems.

ML/DM methods used in the cyber domain, especially for developing tools for IDSs, are still evolving. Further research is needed to develop fast incremental learning in ML/DM methods that could be used for daily updates for IDSs. Investment in accurate well-labelled datasets is the first step toward this goal.

ML/MD method How it can classify malicious network activity
Artificial Neural Networks

Inspired by the brain, this method creates layers of artificial neurons capable of computing their inputs to generate a classifying output.

(Fuzzy) Association

Discovers previousl relationships among different data attributes providing association rules.

Bayesian Network

The maps, the variables and the relationship between them on a probabilistic graphical model.


Finds patterns (similarities) in unlabelled data.

Decision Trees

Method with a tree-like structure that has leaves. The leaves represent the conjunction of features that lead to the classifications.

Ensemble Learning

Ensemble learning searches the hypothesis space to determine the right hypothesis that will make good predictions for a given problem.

Evolutionary Computation

Finds effective computational methods by the principle of survival of the fittest. Useful computational elements survive.

Machine Learning and Data Mining technologies offer promise for system misuse and intrusion detection but are currently limited by a lack of training datasets.