A recent report showed how we can gain a better understanding of phishing attacksand attackers by using cluster analysis.Subsequently, in a recent post we showed how the expectation-maximization algorithm used to cluster these attacks works. But to truly understand cluster analysis, you need to know more than just how clusters are built–it’s equally important to understand what features are significant and how they impact clustering. This post will lay out our methodology for estimating the clusters’ critical features as part of an overall phishing attack analysis.
First, let’s quickly review the clusters we built to understand phishing attacks. Using data we collected over the course of a year spent tracking and taking down phishing cases for a major U.S. financial institution, we extracted features from four categories: similarity analysis, structure analysis, phishing visitors tracking and domain registration. Then, using the expectation-maximization clustering algorithm, we examined three groups of attacks as shown in the following figure.
We found that the phishing sites were easy to cluster when taking into account the page structure. As you can see, cluster2 and cluster3 are very similar in the sense that both are characterized by a very high layout and text similarities with the targeted site. However, when observing the number of references, cluster3 is characterized by a very high number of references to the original site. Lastly, cluster1 represents attacks that are not intended to visually replicate the target site.
But which features are more important for clustering the different attacks?
There are several methods you can use to pre-select the best features for a given clustering algorithm (See Roth & Lange, 2004). However, as the purpose of this study is to explore clustering models as a tool to describe the nature of the attacks, we are more interested in understanding the importance of the features after the cluster is built.
In order to estimate the features’ importance, we used a variation of the method presented in Ceccarelli & Maratea, 2007 to build a machine-learning classifier. In our case, we selected the extremely randomized trees. Then, using the different features, we created a classifier to predict in which cluster each attack belongs. Afterwards, using the mean decrease impurity, we estimated the feature importance to better understand the features’ underlying behavior in the clustering procedure. The mean decrease impurity is defined as the total decrease in node impurity (weighted by the probability of reaching that node which is in turn approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble (see Louppe et al, 2013).
Going back to our example, we estimated the feature importance for the phishing attack cluster as shown in the following figure.
In our example, the features that most differentiate the clusters are the number of references to the attacked domain and the structure and text similarity. When it comes to segmenting the phishing attack, these indicators tell us that it’s more important to segment the attacks according to web page structure than the domain registration or country of the attack.
In this post, I explained our means of extracting feature importance from phishing attack clusters. The method consists in building an extremely randomized trees classifier to predict cluster membership, and then estimate feature importance by calculating the mean decrease impurity of each feature. With this method, we can understand the underlying behavior of the features in the clustering procedure.