The most common way to mitigate phishing attacks is by warning end users that they have navigated to a phishing website, for which the browser must implement a blacklist system. These systems are responsible for matching the current navigation URL with a constantly updated list of blocked URLs -- otherwise known as a blacklist. Even though blacklists theoretically approach perfect accuracy over time for the URLs they know, it’s another story for unfamiliar URLs. By design, this problem cannot be solved easily. A malicious URL must become common and be seen in the system before being added. As such, blacklist performance should not be measured purely on detection performance, but rather as a percentage of phishing URLs against time.
We quantified blacklist performance in terms of detection latency, which we believe is a truer metric for performance and real-world usability. We conducted our experiments on one of the most popular URL blacklists available. To test how fast a blacklist adds new threats, we defined start and stop points. Since both the highly qualified source and blacklist aren’t meant to be queried in real time, we defined a polling interval of five minutes for both systems. At the start of each time interval, the latest phishing URLs are stored, and if previously seen URLs appear on the blacklist for the first time they are time-stamped. Two weeks’ worth of data collection resulted in the following results.
- 22,500 phishing URLs were stored.
- 9,600 of these phishing URLS appeared at some point in time in the blacklist.
- 7,000 of those 9,600 were already in the blacklist when they first appeared in the source.
Below is a graph plotting the accumulated percentage of URLs detected by the blacklist over the course of time, starting with their first appearance in the source.
The blacklist did not include 60 percent (approximately 13,000) of the URLs seen in the phishing source after two weeks. Most of the URLs detected by the blacklist were added no longer than 30 minutes after they were first seen in the source. It is interesting to note the little variation in detection percentage after those 30 minutes passed and the gap it generated. Essentially, if it’s not caught right away, it’s likely to never be caught. The gap could be largely explained by the fact that the blacklist never sees those URLs, which in turn could be explained by the sheer number of phishing campaigns that are launched.
A Pervasive Problem
We are not the only ones who have highlighted the weaknesses of blacklists. The graph below was taken from the Website Hacked Trend Report Q2 2016 by Sucuri. They analyzed four major blacklists and had similar results:
Out of the 9,800 URLS they found as threats, only 18 percent were found in the blacklists. The best performing blacklist, Google Safe Browsing, accounted for 52 percent of that 18 percent, or nine percent of the total threats. Blacklists are simply not cutting it. Arguably, the gap left could be shrunk with more blacklists, but there would still be a very large gap.
Attacking at the Root
To solve this issue, we need to attack the problem at its core. Specifically, how do we reconcile the passage of time and prevalence of phishing URLs needed by a blacklist, while maintaining excellent accuracy? This is where predictive systems can help.
One can implement a predictive system that can score each URL according the likelihood it is malicious based upon the structure and syntax of the URL itself. Such a system would allow for real-time detection of phishing URLs, meaning it can score URLs that a blacklist has not seen which in turn would enable a predictive system to outperform blacklists when addressing the incompleteness problem.
The tradeoff is that these predictive systems provide scores that are not absolute truths (i.e., http://www.phishingsite.com is BAD, while http://www.benignsite.com is GOOD). Instead, predictive tools generally score URLs with a probabilistic score between 0 and 100. (We discussed how probability scores must be used differently than binary scoring systems in the first article in this series.)
Leveraging the Power of Predictive Systems
A concrete example that attests to the feasibility and superior performance of a predictive system as the one described is Swordphish. This technology proves that a predictive system can detect phishing URLs based on URL syntactic features, while having great accuracy and completeness. In internal testing with two million URLs (half phishing and half benign), Swordphish achieved a precision of 93 percent and a completeness of 91 percent at a 60-percent threshold. The chosen threshold is the point at which the precision and completeness curve intersect. Compare that to our blacklist internal test, which had 98 percent or greater precision but a completeness of only 40 percent.
Another benefit is that URLs being queried though Swordphish don’t need to be “known” or stored like a blacklist. Therefore, it can accurately predict the maliciousness of URLs that have never been seen before – no need to wait for a blacklist update to receive the correct result. Swordphish returns a response – on average – in 150 milliseconds for a single URL, a number that decreases when URLs are tested in large batches (to 1,000 per request) by taking advantage of parallelism. Below are two graphs showing Swordphish’s performance.
This graph shows Swordphish can maintain high precision with high completeness at varying threshold. It also shows the tradeoff that predictive models have. You can have a higher threshold (meaning that you are more sure of the positive predictions you have), but at the expense of incompleteness. Or vice versa, a lower threshold means you mark a positive with ease but get a lot of false positives (less precision).
The Receiver Operating Characteristic curve, as a measure of model performance, shows the true positive rate that can be achieved by the model while having a certain false positive rate. The variation in the two is due to the threshold mentioned earlier. The perfect model has 100-percent true positive rate with zero-percent false positive rate, meaning the area under the curve is 1.0. Here, we see that the model for Swordphish achieves an area of 0.98.
Blacklists have an incompleteness problem due to the nature of their inner workings. They require the passage of time and the prevalence of phishing URLs to be included in their system. Swordphish, a predictive system, makes use of probability and previously learned URL phishing patterns to determine whether to block or allow URLs in real time. Because of this, Swordphish helps solve the incompleteness problem, and is a valuable tool in augmenting an organization’s anti-phishing efforts.