Easy Solutions data scientists, including the author of this article, will present extensive research on phishing patterns and correlations between attacks.They will share their expertise at the prestigious eCrime 2016 Symposium on Electronic Fraud Research in Toronto, Canada at the beginning of June. The article below details some of the findings our experts will discuss at the symposium.
In a recent report we showed how we are able to gain better understanding of phishing attacks and attackers by using cluster analysis. This post lays out in greater detail how to create those clusters by examining the features and methods used.
For the study, we used the data collected over the course of more than a year in tracking and taking down phishing cases on behalf a major U.S. financial institution. This data contains everything related to the management of each case, including the Whois details of the domain where the attack was hosted, as well as related RRSet records and the HTML code of the criminal’s phishkit. In total, we collected 3,030 phishing attacks that took place between September and December 2015.
Using the data, we extracted features from the following four categories:
First, we examined the similarities between the phishing site and the target site, evaluating the structure similarity by comparing the HTML source code of both websites. In particular, we counted the number of matching and mismatching tags in order to create an attack index for comparison purposes. Additionally, we compared the similarity of the text by calculating the sites’ cosine similarity.
Then, we analyzed the phishing site’s structure. We used several features extracted from the HTML code, namely: number of forms, type of post action, whether it was a logon form, if the target was in the path, if it was WordPress, and whether the path belonged to a pre-user web directory.
Phishing Visitors Tracking
Using information from a service provided by the company that deploys tools to track phishing sites’ browsing activity, we built an estimator to break out the phishing site’s first visitors and hits. Then, we created the following list of features: first hit’s country, first hit’s region, second hit’s country, second hit’s region.
The last group is related to features extracted from the phishing site domain registration. In particular, using the Whois records, we were able to determine the time elapsed between the domain registration and the phishing event.
After compiling the data listed above, we used the Expectation-Maximization algorithm to create phishing attack clusters that allowed us to group similar phishing sites. Expectation-Maximization is a two-step approach which works as follows:
- Step 1: Guess some cluster centers.
- Step 2: Repeat until converged. This is accomplished by first assigning points to the nearest cluster center and from there, setting the cluster centers to the mean.
Let's quickly visualize this process:
First, we plotted the number of phishing site references compared to the target site and the text similarity of both sites. Then, we initialized the clustering algorithm by randomly guessing some cluster centers (large circles). The algorithm then assigned each phishing site to the nearest cluster center or centroid.
In the next iteration, the algorithm updated the centroid by setting it to the mean of each cluster. Moreover, the algorithm also checked to see whether the number of clusters was optimal, and based on its findings was able to eliminate a cluster centroid and continue with the process using the remaining centroids (See Iter 2).
Finally, using this algorithm with the complete features set, we estimated cluster sets. We found that taking into account the page structure, phishing sites were easy to cluster. As the next figure demonstrates, cluster2 and cluster3 are very similar in the sense that both are characterized by very high layout and text similarities with the targeted site. However, when observing the number of references, cluster3 is characterized by a very high number of references to the original site. Lastly, cluster1 represents attacks that are not intended to visually replicate the target site.
The purpose of this post is to demonstrate the importance of performing analysis to look for patterns and correlations between phishing attacks. By being able to classify phishing cases based on the attacker’s strategy, we were able to identify that there were three main clusters of phishing sites based on the HTML structure and similarity with the target site.