In our series, Machine Learning Algorithms Explained, our goal is to give you a good sense of how the algorithms behind machine learning work, as well as the strengths and weaknesses of different methods. Each post in this series briefly explains a different algorithm – today, we’re going to talk about K-Means clustering.
Previously in this series, we have looked at decision trees and random forests, two types of supervised learning algorithms. Supervised algorithms are trained with data that provides input vectors as well as their corresponding target vectors, or the output that is expected after the data is processed.
Unsupervised models, on the other hand, are trained using data that consists only of input vectors, with no specific target output in mind. Instead of telling an unsupervised algorithm what it should be looking for in the data, the algorithm does the work itself, in a sense independently finding structure within the data.
K-Means clustering is an unsupervised learning algorithm that, as the name hints, finds a fixed number (k) of clusters in a set of data. A cluster is a group of data points that are grouped together due to similarities in their features. When using a K-Means algorithm, a cluster is defined by a centroid, which is a point (either imaginary or real) at the center of a cluster. Every point in a data set is part of the cluster whose centroid is most closely located. To put it simply, K-Means finds k number of centroids, and then assigns all data points to the closest cluster, with the aim of keeping the centroids small.
K-Means starts by randomly defining k centroids. From there, it works in iterative (repetitive) steps to perform two tasks:
- Assign each data point to the closest corresponding centroid, using the standard Euclidean distance. In layman’s terms: the straight-line distance between the data point and the centroid.
- For each centroid, calculate the mean of the values of all the points belonging to it. The mean value becomes the new value of the centroid.
Once step 2 is complete, all of the centroids have new values that correspond to the means of all of their corresponding points. These new points are put through steps one and two producing yet another set of centroid values. This process is repeated over and over until there is no change in the centroid values, meaning that they have been accurately grouped. Or, the process can be stopped when a previously determined maximum number of steps has been met.
Let see how this algorithm works using the well-known iris flower dataset that we used in our Decision Trees post. This dataset contains four measurements of three different iris flowers. The measurements are: sepal length, sepal width, petal length, and petal width. The three types of iris are Setosa, Versicolour, and Virginica, shown below in that order.
Let's first plot the values of the petals’ lengths and widths against each other.
With just a quick glance, it is clear that there are at least two groups of flowers shown on the chart. Let’s see how we can use a K-means algorithm to find clusters in the data.
Applying the K-Means Algorithm
Iteration 1: First, we create two randomly generated centroids and assign each data point to the cluster of the closest centroid. In this case, because we are using two centroids, our k value is 2.
Iteration 2: As you can see above, the centroids are not evenly distributed. In the second iteration of the algorithm, the average values of each of the two clusters are found and become the new centroid values.
Iterations 3-5: We repeat the process until there is no further change in the value of the centroids.
Finally, after iteration 5, there is no further change in the clusters.
The algorithm explained above finds clusters for the number k that we chose. So, how do we decide on that number?
To find the best k we need to measure the quality of the clusters. The most traditional and straightforward method is to start with a random k, create centroids, and run the algorithm as we explained above. A sum is given based on the distances between each point and its closest centroid. As an increase in clusters correlates with smaller groupings and distances, this sum will always decrease when k increases; as an extreme example, if we choose a k value that is equal to the number of data points that we have, the sum will be zero.
The goal with this process is to find the point at which increasing k will cause a very small decrease in the error sum, while decreasing k will sharply increase the error sum. This sweet spot is called the “elbow point.” In the image below, it is clear that the “elbow” point is at k-3.
Now, let's take a look at the visual differences between using two, three, four, or five clusters.
The different graphs also show that three is the most appropriate k value, which makes sense when taking into account that the data contains three types of iris flowers.
The K-Means algorithm is a great example of a simple, yet powerful algorithm. For example, it can be used to cluster phishing attacks with the objective of discovering common patters and even discovering new phishing kits. It can also be used to understand fraudulent patterns in financial transactions.
Advantages of K-Means:
- Widely used method for cluster analysis
- Easy to understand
- Trains quickly
Disadvantages of K-Means:
- Euclidean distance is not ideal in many applications
- Performance is (generally) not competitive with the best clustering methods
- Small variations in the data can result in a completely different clusters (high variance)
- Clusters are assumed to have a spherical shape and be evenly sized
This post is part of the series Machine Learning Algorithms Explained. Click here to take a look at the other posts in the series.