In our new series, Machine Learning Algorithms Explained, our goal is to give you a good sense of how the algorithms behind machine learning work, as well as the strengths and weaknesses of different methods. Each post in this series will briefly explain a different algorithm.
But First, What Is Machine Learning?
Machine Learning is about building programs with adaptable parameters (typically an array of floating point values) that are adjusted automatically based on the data they receive. This allows them to improve their behavior by independently adapting to the previously seen data.
Machine Learning is a subfield of Artificial Intelligence: The algorithms that make up Machine Learning are like building blocks that make computers learn to behave more intelligently. They do this by generalizing data, meaning that they are able to accurately perform functions on previously unseen datasets, whereas a non-intelligent database system just stores and retrieves data items.
Machine learning algorithms fall into two categories--supervised and unsupervised. Supervised algorithms use training data that is comprised of input vectors as well as their corresponding target vectors, or expected output. Unsupervised algorithms use training data that only consist of input vectors; the algorithm creates groups and subgroups within the data, a process known as clustering.
A real-life example would be a child learning about different types of fruits. You can tell the child, “This is an apple, this is a cherry, and this is a banana,” which would be the equivalent of supervised learning. On the other hand, you can give a child the three fruits, but not give them names. The child will then classify them – banana is the yellow fruit, and cherries and apples are red. To distinguish between the fruits classified as “red” the child will use more data: cherries are red and small, and apples are red and big. This is the equivalent of unsupervised learning.
Decision Trees (DT)
A Decision Tree is a supervised predictive model that can learn to predict discrete or continuous outputs by answering a set of simple questions based on the values of the input features it receives.
To get a better understanding of how DT works, we will use a real-world dataset to better illustrate the concept.
This dataset contains four measurements of three different iris flowers. The measurements are: the sepal length, sepal width, petal length, and petal width. The three types of iris are Setosa, Versicolour, and Virginica, shown below in that order.
Our task is to create a Decision Tree model that can predict iris type based on these four measurements. Let’s begin.
Below is a random sample of 10 flowers from the dataset.
Each of the four measurements is a two-digit number in centimeters, and the target value represents the type of iris by assigning a 0 (Setosa), 1 (Versicolour), or 2 (Virginica).
Decision Trees Overview
Decision Trees use different attributes to repeatedly split the data it receives into subsets, and repeats the process until the subsets are pure. Pure means that they all share the same target value. In the iris dataset, the DT may choose to split the data into two by asking which flowers have a petal length less than 2.45 cm:
Data that has petal length < 2.45 cm
Data that has petal length > 2.45 cm
In this case, the first dataset, representing all flowers with a petal length less than 2.45 cm or less, is pure: all of the target values in the subset are 0. The other subset is not pure, however, as it contains data points with target values of both 2 and 1. To further split the data, the DT can ask which flowers have a sepal length of less than 6 cm. This results in splitting the data into two pure subsets:
Data that has sepal length < 6 cm
Data that has sepal length > 6 cm
Thus, in order to predict the iris type of a new sample, we can follow the splits and evaluate each decision using our new sample data.
For example, if we had the following unknown iris measurements:
We would classify the flower as Versicolor (target value 1), as the petal length is more than 2.45 cm and the sepal length is more than 6 cm.
Decision Trees Algorithm
When choosing attributes to use as criteria for splits, it is possible to choose any attribute. How does the DT algorithm decide which algorithm and condition to split on?
To put it simply, the DT algorithm chooses the attribute and condition based on how pure the resulting subsets will be. The pureness of a subset is determined by the certainty of the target values within a subset. For example, in our iris dataset, we split the data using the criteria of sepal length < 6 cm (condition 1) and came up with the following distribution of target values in the subsets.
- First subset
- Three examples: all with target value 1
- Second subset
- Four examples: all with target value 2
In this case, we are completely certain about which target values can be found in each subset: the first subset contains 100% target value 1, and the second subset contains 100% target value 2.
Now, let’s image that we have split the same data using an imaginary condition (condition 2) that resulted in the following distribution:
- First subset
- Five examples: Three with target value 1 and two with target value 2
- Second subset
- Two examples: One with target value 1 and one with target value 2
In the first subset, we do not have complete certainty about the target values. There are three examples with target value 1 and two with target value 2. In the second subset, we have complete uncertainty: the chance of the target value being either 1 or 2 is exactly 50%.
In this case, the algorithm will prefer the first split attribute and condition as it allows for more certainty of the target values.
Quantifying Pureness of a Subset
To measure the pureness or uncertainty of a subset, we need to provide a quantitative measure so that the DT algorithm can be objective when choosing the best attribute and condition to split on. There are different ways to measure the uncertainty in a set of values, but for the purposes of this example, we will use Entropy (represented by “H”).
Where X is the resulting split, n is the number of different target values in the subset, and pi is the proportion of the ith target value in the subset.
For example, the entropy for conditions 1 and 2 will be the following. The log is base 2.
- H(Condition1 first subset) = - ( 1 * log(1) ) = 0 Pure subset
- H(Condition1 second subset) = - ( 1 * log(1) ) = 0 Pure subset
- H(Condition2 first subset) = - ( 3/5 * log(3/5) + 2/5 * log(2/5) ) = 0.97 Impure subset
- H(Condition2 second subset) = - ( 1/2 * log(1/2) + 1/2 * log(1/2) ) = 1 Impure subset
Quantifying the Purity of All Subsets
The final step is to determine how to measure the impurity of the resulting subsets once an attribute and a condition have been chosen. In order to do this, we need a way to aggregate the impurity of all the subsets into one measure. This is done using the Information Gain Measure, which measures the expected decrease in impurity after a split. This formula for information gain is:
S represents the subset before any splitting, and D represents the possible outcomes of splitting using a given attribute and condition. V assumes that all the values D can be measured one by one. The parallel lines on V and S denote the size of the set.
In layman’s terms: The gain calculates the difference in entropy before and after the split.
Advantages of decision trees:
- Can be used for regression or classification
- Can be displayed graphically
- Highly interpretable
- Can be specified as a series of rules, and more closely approximate human decision-making than other models
- Prediction is fast
- Features don't need scaling
- Automatically learns feature interactions
- Tends to ignore irrelevant features
- Non-parametric (will outperform linear models if relationship between features and response is highly non-linear)
Disadvantages of decision trees:
- Performance is (generally) not competitive with the best supervised learning methods
- Can easily overfit the training data (tuning is required)
- Small variations in the data can result in a completely different tree (high variance)
- Recursive binary splitting makes "locally optimal" decisions that may not result in a globally optimal tree
- Doesn't tend to work well if the classes are highly unbalanced
- Doesn't tend to work well with very small datasets