In our series **Machine Learning Algorithms Explained**, our goal is to give you a good sense of how the algorithms behind machine learning work, as well as the strengths and weaknesses of different methods. Each post in this series briefly explains a different algorithm – today, we’re going to talk about Naive Bayes Classifiers.

A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the *naive* assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.

**What Is the Bayes’ Theorem?**

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen *given that *another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

- “P” is the symbol to denote probability.
- P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
- P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
- P(A) = The probability of event B (hypothesis) occurring.
- P(B) = The probability of event A (evidence) occurring.

**Drug Testing **

To better illustrate the usefulness of Bayes’ Theorem, let’s walk through an example. Suppose that a drug-testing method has the following statistical results:

- 0.5 percent of the population uses drugs.
- The test is positive for drug users 99 percent of the time (and therefore, it misses 1 percent of drug users).
- The test is negative for a non-drug user 99 percent of the time (and positive for a non-drug user 1 percent of the time).

Drug User (.05%) | Non-Drug user (99.5%) | |

Positive results | 99% | 1% |

Negative results | 1% | 99% |

We would like to know the odds that a randomly selected individual who had a positive test result is actually a drug user, i.e. that the results are a *true positive*. In the language of probability, that looks something like this:

- P( User | Positive ) = The probability that an individual that had a positive test result is actually a drug user.

Following Bayes’ Theorem, we know that the probability is calculated as follows:

P( User | Positive ) = ( P( Positive | User ) * P( User) ) / P(Positive)

Now, let’s calculate each of the terms on the right side of the equation.

- P( Positive | User ) = 0.99. This is the chance of returning
*true positive*test results if you are a drug user. - P( User) = 0.005. The percentage of the population that are drug users.
- P(Positive) = P(Positive | User) * P(User) + P(Positive | Non-user) * P(Non-user), which translates to: .99*.005 + .01*.995 = 0.0149. [This is the breakdown of how a positive result— a user testing positive(
*true positive*) and a non-user testing positive (*false positive*)—can occur.]

Applying these figures to the original formula gives you:

P( User | Positive ) = (.99 * .005 ) / .99 = 33%

What a surprising result! At first, the test seems highly accurate. However, upon further calculation, we see that it’s far more likely that even after testing positive a person is, in fact, not a drug-user.

**Naive Bayes Classifier**

Given a vector **x** of features, Naive Bayes calculates the probability that the vector belongs to each class.

P( C_{k} | x_{1}, x_{2}, … x_{n} )

Thanks to Bayes’ Theorem, we know that it is equal to:

P( C_{k} | **x**) = (P(**x** | C_{k}) *P(C_{k}) ) / P(**x**)

We know P(C_{k}) because of the class distribution in our data.

P(**x** | C_{k}) is equivalent to its joint probability P(C_{k} ,x_{1} , x_{2}, …, x_{n}). By the chain rule in probabilities we can expand it to:

P(C_{k} ,x_{1} , x_{2}, …, x_{n}) = P(x_{1} | x_{2}, …, x_{n}, C_{k}) * P(x_{2}, …, x_{n}, C_{k})

= P(x_{1} | x_{2}, …, x_{n}, C_{k}) * P(x_{2}, | x_{3, }…, x_{n}, C_{k}) * P(x_{3, }…, x_{n}, C_{k})

= ….

= P(x_{1} | x_{2}, …, x_{n}, C_{k}) * P(x_{2}, | x_{3, }…, x_{n}, C_{k}) * P(x_{n-1, }| x_{n}, C_{k})* P(x_{n} | C_{k})* P(C_{k})

We now make a strong assumption on the conditional probabilities we just calculated. We assume that they are conditionally independent. In other words, knowing that { x_{i+1, }…, x_{n} } occurred doesn’t affect the probability of x_{i} occurring. Put more formally:

P(x_{i}, | x_{i+1, }…, x_{n}, C_{k}) = P(x_{i}, | C_{k}).

This means that: P(C_{k} ,x_{1} , x_{2}, …, x_{n}) =

To calculate each of the conditional class probabilities—P(x_{i} | C_{k} )—we use a likelihood function to model the probability distribution. One of the most common is the Gaussian or normal distribution.

We need not calculate P(**x**) since it’s constant for each input. But you can calculate it in terms of the conditional class probabilities as illustrated below:

P(**x**) =

To make a prediction, we choose the class that had the highest score.

**Naive Bayes Examples **

We are going to use the Iris dataset we used in the previous blogs to illustrate how Naive Bayes works. Let’s suppose we measure an Iris Setosa and find the following measurements:

- Sepal length = 7 cm
- Sepal width = 3 cm
- Petal length = 5 cm
- Petal width = 2 cm

From our data we know that each class—Iris versicolor, Iris virginica, and Iris setosa—represents one-third of the data. Following Naive Bayes, we need to calculate the following conditional probabilities and categorize our measured flower with the class that has the highest probability. To do so, we are going to use the Gaussian measure of likelihood:

- P( Setosa | 7,3,5,2)
- P( Versicolor | 7,3,5,2)
- P( Virginica | 7,3,5,2)

Let’s do one calculation with Versicolor:

- P( Versicolor | 7,3,5,2) = (P( 7,3,5,2 | Versicolor ) * P( Versicolor ) )/ P( 7,3,5,2)

Assuming independence and using the Gaussian distribution of conditional class probabilities, we can calculate the following:

- P( 7,3,5,2 | Versicolor ) = P(7 | Versicolor ) * P(3 | Versicolor ) * P(5 | Versicolor ) * P(2 | Versicolor )

To calculate each of the conditional class probabilities, we have to find the average and standard deviation of each of the features operating under the assumption that they are Versicolor. The average and standard deviation are calculated as follows.

Average =

Standard deviation =

Sepal Length | Sepal Width | Petal Length | Petal Width | |

Average | 5.936 | 2.77 | 4.26 | 1.32 |

Standard Deviation | 0.51 | 0.31 | 0.46 | 0.19 |

Plugging those values into a Gaussian distribution, we can calculate: ( N stands for the normal distribution)

P( 7,3,5,2 | Versicolor ) = P(7 | Versicolor ) * P(3 | Versicolor ) * P(5 | Versicolor ) * P(2 | Versicolor )

P( 7,3,5,2 | Versicolor ) = N(7 | 5.936, 0.51) * N(3 | 2.77, 0.31) * N(5 | 4.26, 0.46) * N(2 | 1.32, 0.19)

P( 7,3,5,2 | Versicolor ) = 0.089 * 0.97 * 0.24 * 0.05

P( 7,3,5,2 | Versicolor ) = 0.001

P( Versicolor ) = 50/150

At last we can calculate: P( 7,3,5,2 | Versicolor ) 0.001* 0.33 = .0003

From here, we would need to calculate the same process for Setosa and Virginica in order to determine in which class the original flower is most likely to belong.

**Summary **

Advantages of Naive Bayes Classifiers

- Simple model
- Fast
- Scalable
- Requires little data

Disadvantages of Naive Bayes Classifiers

- Assumes feature independence
- Must choose the likelihood function

This post is part of the series *Machine Learning Algorithms Explained*. Click here to take a look at the other posts in the series.