Naive Bayes in Machine Learning
Table of contents:
- Assumptions of Naive Bayes
- Bayes Theorem
- Example of Naive Bayes
- Problem with Naive Bayes
- Laplace or Additive Smoothing
- Bias Variance Tradeoff in Naive Bayes using hyper-parameter α
- Example of Laplace smoothing
- Applications of Naive Bayes
- Pros of using Naive Bayes
- Cons of using Naive Bayes
- How to build a basic model using Naive Bayes in Python
- Tips to improve the power of the Naive Bayes Model
Naive Bayes is a supervised machine learning classifier based on probabilities. The name ‘Naive’ means that the classifier makes a “naive” assumption that each of the input features are conditionally independent, i.e. changing one feature won't affect the other features.
The Naive Bayes model is simple, yet powerful and easy to implement. One of the advantages of this model is that it does not involve complex mathematical computations and thus, it can be used with a large amount of data.
Naive Bayes is a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes’ theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities.
2. Assumptions of Naive Bayes:
The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome. This means that for the given outcome variable (y) we assume that the given input features (X) are independent of each other and they possess equal weightage.
The assumptions made by Naive Bayes do not generally hold in real-world situations. In fact, the independence assumption is almost never correct but often works well in practice.
3. Bayes Theorem:
The Naive Baye classifier is based on the Bayes theorem with the above-mentioned assumptions. Bayes theorem provides a way of calculating the posterior probability P(c|x) from P(c), P(x), and P(x|c). Naive Bayes classifier assumes that the effect of the value of a feature/predictor (x) on a given outcome variable/class (c) is independent of the values of other features. This assumption is called conditional independence.
- P(c|x) is the posterior probability of class (outcome) given predictor (feature).
- P(c) is the prior probability of class.
- P(x|c) is the likelihood which is the probability of the predictor given class.
- P(x) is the prior probability of predictor.
Now, with regards to the dataset with input (X, y), we can apply the Bayes theorem in the following way:
where y is the outcome variable and X is input features vector X=(x1, x2, …., xn) of size n. We can also represent this as follows (in terms of component features of the feature vector X):
Note that for a given input, the denominator would be a constant for all outcomes (y), so we can ignore it and re-write as below:
4. Example of Naive Bayes:
To find whether a given email is spam or not, let’s consider a small dataset of 12 emails containing 8 non-spam and 4 spam emails with the below words inside them:
In the above image, we calculated the conditional probability or likelihood (P (word | Non-spam) and P (word | spam)) of seeing each word, given we saw the word is in either Non-spam or spam email and the probability of spam and non-spam emails (P (spam) and P (Non-spam) called class probabilities).
For a new given email containing, ‘Dear Friend’ we want to determine if it belongs to spam or non-spam email we can use the below formula of Naive Bayes:
From the above calculation, it's clear that the email with the phrase ‘Dear Friend’ is identified as Non-Spam because 0.09 > 0.01.
Naive Bayes ignore all grammar rules because keeping track of every single reasonable phrase in a language would be impossible. It treats language like it is just a bag full of words.
In Machine learning, we’d say that by ignoring relationships among words, Naive Bayes has a high bias. But because it works well in practice and doesn’t change on changing the training dataset, Naive Bayes has low variance.
5. Problem with Naive Bayes:
The problem with the Naive Bayes classifier is that if a word is not present in the Spam class, it will always be classified as a normal (Non-spam) email even if the email is spam. For example, text containing the word ‘Lunch’ will always be classified as Non-Spam because the word ‘Lunch’ is not present in Spam emails. But in actual it should not be the case, so the resolve this issue solution to use Laplace/Additive smoothing.
If a word is not present in training, then add 1 (or greater number) to every word count, so it never gives 0 when multiplied with Bayes probability.
6. Laplace or Additive Smoothing:
At the end of the training, all the likelihoods and priors are computed.
At test time, say we find a new word W’ for which likelihood is not available (=0), we will use Laplace smoothing, as the word is not present in training data. We will add a smoothing value to the numerator and denominator for the likelihood probability of the new word as below:
Laplacian smoothing is applied to all words in training data and also to new words that occur in test data.
We have an additive smoothing and generally, 1 additive smoothing is applied i.e. Laplace smoothing with α = 1. In simple language, Laplace smoothing formula could be:
7. Bias Variance Tradeoff in Naive Bayes using hyper-parameter α:
In Naive Bayes, there is one hyper-parameter α in Laplace smoothing which determines bias and variance tradeoff.
Bias is the inability of a Machine Learning method to capture the true relationship. Example: Line instead of a curve which can cause Underfitting.
Variance is the difference in fits between data sets (train vs test fit performance), It will cause Overfitting. ∑∑
Case1: When there is no Laplace smoothing (α = 0). Let’s consider a word that occurs only 2 (a very few) times out of 2000 total words. And we are giving the probability to the rare, occurred word as well and hence overfitting in a way.
A small change in train data (example: removing 2 words from X) will result in a large change in probability (example: model probability changes from 2/1000 to 0/1000) will result in high variance and hence cause overfitting.
Case2: When α is very large, the model will not be able to distinguish between classes. And result in under-fitting or High Bias.
Best α can be identified by using simple cross-validation or k-fold cross-validation in Naive Bayes.
8. Example of Laplace smoothing:
After adding 1 count (α =1) to each word count we get below the probability of each observed word which is never 0:
9. Applications of Naive Bayes:
- Real-time prediction: Naive Bayes is an eager learning classifier and is quite fast in its execution. Thus, it could be used for making predictions in real-time.
- Multi-class prediction: The Naive Bayes algorithm is also well-known for multi-class prediction, or classifying instances into one of several classes.
- Text classification or spam filtering or sentiment analysis: When used to classify text, a Naive Bayes classifier often achieves a higher success rate than other algorithms due to its ability to perform well on multi-class problems while assuming independence. As a result, it is widely used in spam filtering (identifying spam email) and sentiment analysis (e.g. in social media, to identify positive and negative customer sentiments).
- Recommendation Systems: A Naive Bayes Classifier can be used together with Collaborative Filtering to build a Recommendation System that could filter through new information and predict whether a user would like a given resource or not.
10. Pros of using Naive Bayes:
- It is easy and fast to predict the class of test data set. It also performs well in multi-class prediction
- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression, and it needs less training data.
- It performs well in the case of categorical input variables compared to the numerical variables. For numerical variables, the normal distribution is assumed (bell curve, which is a strong assumption).
11. Cons of using Naive Bayes:
- If the categorical variable has a category (in the test data set), which was not observed in the training data set, then the model will assign a 0 probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique like Laplace estimation.
- On the other side, naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
- Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.
12. How to build a basic model using Naive Bayes in Python:
Using the python library scikit-learn we can build a Naive Bayes model. There are three types of Naive Bayes model under the Scikit-learn library:
- Gaussian: It is used in classification, and it assumes that features follow a normal distribution. With the iris dataset (which follows Gaussian distribution) GaussianNB model is as follows:
- Multinomial: The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. Below is an example of Multinomial Naive Bayes with the MNIST digit dataset. Each sample (belonging to 10 classes) is an 8×8 image encoded as an unsigned integer (0–255):
If we compare Gaussian and Multinomial NB, we get a higher accuracy score from the MulitnomialNB model.
- Bernoulli: If X is random variable Bernoulli-distributed, it can assume only two values (Example: 0 and 1) and their probability is:
Like MultinomialNB, Bernoulli classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.
Let's see below example on dummy data generated using sklearn library make_classification with binary classes:
13. Tips to improve the power of the Naive Bayes Model:
- If continuous features do not have a normal distribution, we should use log transformation or different methods to convert them into the normal distribution.
- Remove correlated features, as the highly correlated features are voted twice in the model, and it can lead to overinflating importance.
- Naive Bayes classifiers have limited options for parameter tuning like α=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options. It is recommended to focus on pre-processing of data and feature selection.
- You might think to apply some classifier combination techniques like ensembling, bagging, and boosting, but these methods would not help. Actually, ensembling, boosting, bagging won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize.