In our series, Machine Learning Algorithms Explained, our goal is to give you a good sense of how the algorithms behind machine learning work, as well as the strengths and weaknesses of different methods. Each post in this series briefly explains a different algorithm – today, we’re going to talk about Random Forests.
Random Forests are supervised ensemble-learning models used for classification and regression. Ensemble learning models aggregate multiple machine learning models, allowing for overall better performance. The logic behind this is that each of the models used is weak when employed on its own, but strong when put together in an ensemble. In the case of Random Forests, a large number of Decision Trees, acting as the “weak” factors, are used and their outputs are aggregated, with the result representing the “strong” ensemble.
In any machine learning model, there are two sources of error: bias and variance. To better illustrate these two concepts, let’s imagine that we have created a machine learning model, and we are training it on different parts of the same data. The machine learning model creates various outputs on the different sections of data, and we also have the actual outputs from the data. To determine the bias and variance, we compare these two outputs. The bias is a measure of how the predicted values from the machine learning model differ from the actual value, and the variance is how spread out those predictions are.
Bias is an error that occurs when an algorithm makes too many simplifying assumptions, causing it to predict values that differ from the actual values.
Variance is an error that results from an algorithm’s sensitivity to small changes in the training dataset; a higher variance means that an algorithm will be more strongly influenced by the specifics of the data.
Ideally, both bias and variance will be low, meaning that the model will predict values that are very close to the correct values for different data across the same dataset. When this occurs, the model can accurately learn the underlying patterns in a dataset.
Random Forests as a Method to Reduce Variance
Decision Trees are known for showing high variance and low bias. This is mainly due to their capacity to model complex relationships, even to the point of overfitting the noise in the data (overfitting = not being general enough). Put simply: Decision Trees train models that are usually accurate, but that often show a large degree of variability between different data samples taken from the same dataset.
Random Forests reduce the variance that can cause errors in Decision Trees by aggregating the different outputs of the individual Decision Trees. Through majority voting, we can find the average output given by most of the individual Trees, thus smoothing out the variance so that the model will be less prone to producing results further away from the real values.
The idea behind Random Forests is to take a set of high-variance, low-bias Decision Trees and transform them into a new model that has both low variance and low bias.
Why Are Random Forests Random?
The random in Random Forest comes from the fact that the algorithm trains each individual decision tree with different subsets of the training data, and each node of each decision tree is split using a randomly selected attribute from the data. By introducing this element of randomness, the algorithm is able to create models that are not correlated with each other. This results in possible errors being spread out evenly throughout the model, meaning that they will eventually be canceled out through the majority voting decision strategy of Random Forest models.
How Would a Random Forest Work in the Real World?
Imagine that you’re bored of hearing the same techno music over and over again. You desperately want to find some new music that you might like, so you go online to find recommendations. You find a website that lets real people give you music suggestions based on your preferences.
So how does it work? First, in order to avoid suggestions that are simply random, you fill out a questionnaire about your basic music preferences, providing a baseline for the type of music you might like. Using that information, people from the website begin to analyze songs using the criteria (features) that you provided. Each individual person is essentially working as a decision tree.
Individually, the people making suggestions are likely to poorly generalize your music preferences. For example, one person may conclude that you do not like any songs from before the 1980's, and will therefore not include any in your recommendations. However, this could be an inaccurate assumption, and would cause you to not receive suggestions for music you are likely to enjoy.
Why is this mistake happening? Each of the people giving recommendations only has limited information about your preferences, and they are also biased by their own, individual taste in music. To fix this, we combine the suggestions from many individuals (each acting as a Decision Tree) and use majority voting on their suggestions (essentially creating a Random Forest).
However, there is still one more problem – because each of the people is using the same data from the same questionnaire, the resulting suggestions will not be varied and may be highly biased and correlated. In order to expand the range of suggestions, each of the recommenders is given a random set of your answers instead of all of them, meaning that they have less criteria with which to make their recommendations. In the end, the extreme outliers are eliminated through majority voting, and you are left with an accurate and varied list of recommended songs.
Advantages of Random Forests:
- No need for feature normalization
- Parallelizable: individual Decision Trees can be trained in parallel
- Widely used
- Reduces overfitting
Disadvantages of Random Forests:
- Not easily interpretable
- Not a state-of-the-art method
This post is part of the series Machine Learning Algorithms Explained. Click here to take a look at the other posts in the series.