This post is part of our continuing series of exploring and integrating new probabilistic tools for fraud prevention.
Phishing, by definition, is the act of defrauding an online user and tricking them into clicking on a malicious link in order to obtain personal information by posing as a trustworthy institution or entity. That is why users have a hard time differentiating between a legitimate and a malicious site. Although one might think the very act of concealing oneself by mimicking makes the task of identifying the real or phishing page harder, it can be quite the contrary. In the effort to conceal their forgery, patterns and behaviors emerge. Below, we explore how we found those patterns and exploited them to correctly discriminate between a trustworthy institution and someone merely posing as one.
The first thing a phishing site tries to do is get you to their website, typically by clicking on a link in a deceptive email that appears to come from a legitimate financial institution. Once you’ve clicked, they try to keep up the charade by obfuscation or with misleading information. For example, they may overcomplicate the URL syntax to convey a complicated process, change a letter on the trustworthy site’s URL, or append the trustworthy site’s domain in the phishing URL. Based on these characteristics, the research team started to look for measurable patterns in the data that could help discriminate between legitimate and phishing URLs. We followed a classification process consisting of first extracting the features out of the URLs, then training machine learning classifiers. Finally, we evaluated the performance of the model using well-known statistical measures.
Classification based on URLs facilitates a defense against all phishing attacks due to the fact that every attack has one. We analyzed approximately 40 features based on the structure of the URL, including, for example, estimating Kullback-Leibler Divergence between the normalized character frequency of the English language and the URL, which means comparing how similar the words in the URL and the English vocabulary are. Other features include the number of “@” and “-” symbols, the number of top-level domains in the URL, whether the URL is an IP address, the URL length and the number of suspicious words in the URL. For example, we evaluate the following features:
- Length ratio
- Symbol count
- TLD count
- IP address
- Suspicious word count
- Character frequency
- Euclidean distance
- Kolmogorov– Smirnov statistic
- Kullback-Leibler Divergence
Then, we used the features from millions of phishing URLs extracted from common phish repositories and millions of legitimate (Ham) URLs from the CommonCrawl corpus to train different machine learning classifiers. In particular, we evaluated different algorithms such as random forests, support vector machines and extreme gradient boosting and used a process of cross-validation to select the best model and parameters that maximizes the predictive power, measured by F1-Score.
Analyzing and Applying Results
The resulting model had an accuracy of more than 95 percent and showed great stability across the cross-validation sets. These results confirm that a simple defense vector such as this one provides great technical results due to its straightforwardness and strong statistical measures of performance.
Existing phishing detection solutions are just not working. Companies need a reliable means to efficiently and easily discover phishing sites. The classification process we highlight above is highly accurate and can be deployed to help prioritize or filter thousands of potential phishing cases that threaten organizations on a daily basis. By using this improved technology, analysis workload can be reduced by 70 percent to 80 percent, allowing for a more direct focus on URLs with a higher probability of being malicious.