Is machine learning useful for finding malicious URLs?

Ajouter à mon PDF personnalisé

Detecting Malicious URLs Using Lexical Analysis

A Universal Resource Locator (URL), or web address, identifes pages and content on the web. Attackers can use malicious URLs as a part of online attacks to harm visitors. A common technique used to flter out malicious URLs involves blacklisting known harmful webpages. However, attackers can subtly change web addresses, rendering blacklists useless. Companies often use expensive and complicated technologies to determine malicious addresses but do not provide the resulting lists freely. Heuristic-based techniques are an alternative that can identify newly created malicious websites in real-time. However, they are not infallible and could be anticipated by attackers. Because detection techniques are time and resource intensive, they are generally limited to the classifcation of URLs or to a specifc attack. For these reasons, there is a need for approaches that can better detect and categorize malicious URLs.

Machine learning techniques offer a potential solution. Machine learning techniques can already classify malicious websites by their URL, content and network activity. Mamun et al. investigated the use of these techniques for identifying bad URLs. The researchers collected about 110 000 URLs known for spam, phishing, malware distribution and defacement, as well as benign URLs. They then created a program that could identify 79 features of the words used in the URLs and taught it to detect the implications of these lexical features. The lexical features included elements such as the length and the characters used to compose the URL. The program found fve sets of lexical features that help to identify bad URLs. The researchers also looked at six techniques used maliciously to mask or ‘obfuscate’ harmful URLs. Finally, they created and tested malicious URL classifers using the selected features and obfuscation techniques.

The selected features and classifers appeared to be highly accurate. The classifers detected 98% of spam and malware URLs and 99% of defacement URLs. These methods provide a measure of confdence on whether a URL is malicious or not.

The methods developed showed very high detection results. Security professionals can use machine learning to detect malicious URLs including defacement URLs. They can use the identifed classifers to augment blacklist techniques and increase protection against malicious URLs.

Machine learning can provide an additional tool for identifying bad web addresses.