Naive Bayes For Text analysis

Rasika Gurav Data Scientist at Equation.works

5 min readApr 5, 2021

Naïve Bayes

Table of Content:

i. Introduction

ii. Key points

iii. Bayes Theorem

iv. Advantages

v. Disadvantages

vi. Practical approach

vii. Conclusion

viii. Source code

ix. References

Introduction:

In Machine Learning we use three types of learning algorithms mainly

a) supervised learning

b) Unsupervised learning

c) Reinforcement Learning

In this article we will go through Supervised Learning Naïve Bayes algorithm.

Key points of algorithms:

> Naïve bayes is supervised learning algorithm.

> This algorithm is based on bayes’ theorem and used in solving classification problems.

> It predicts results on the basis of probability of an object.

> It uses small and high dimensional training set

> Mainly used for text classification.

> It is easy and fast to predict the class of the test data set using Naïve bayes.

> performs well in multi-class prediction.

> This is the simplest classification algorithm which helps in building machine learning models for quick predictions.

> Examples: Spam filtering, text classification, sentiment analysis and recommender systems.

Naïve Bayes:

* Naïve: Occurrence of certain feature is independent of occurrence of the other feature i.e. features are not correlated to each other. Each feature individually contributes for prediction.

* Bayes: It uses principle of Bayes theorem

Bayes’ Theorem:

Where,

> P(a|x)= posterior probability of class a given predictor. Probability of a on the observed event x or probability of a after occurrence of event x.

> P(a)=prior probability (probability of a before occurrence of event x) .

> P(x|a)= Likelihood probability. Probability of the event given that the probability of the hypothesis is true i.e. probability of event x after the event a occurred.

>P(x)=probability of event

For detail understanding of our theorem we will take popular example of “Weather condition” data-set and corresponding target “Play”. Using this data-set we need to decide whether we can play on particular day based on weather condition of that day.

Table Representation:

Probability Table:

So from above table we can see

Probabilities:

P(Rainy)=4/14(4 days out of total 14 days)=0.28

P(Sunny)=5/14(5 days out of total 14 days)=0.36

P(Overcast)=5/14(5 days out of total 14 days)=0.36

P(Rainy|Yes)=2/10=0.2

P(Rainy|No)=2/4=0.5

P(Sunny|Yes)=3/10=0.3

P(Sunny|No)=2/4=0.5

P(Overcast|Yes)=5/10=0.5

P(Overcast|No)=0/4=0

P(Yes)=10/14=0.71

P(No)=4/14=0.29

Advantages:

1. It is easy and fast method to predict the class of the test data set.

2. Perform well in multi-class prediction.

3. When assumptions are independent on each other Naïve bayes’ performs better than logistic regression using less training data.

4. Mostly used for text classification i.e. it performs better on categorical datasets compare to numerical datasets.

Disadvantages:

1. If categorical variable is not observed in training data set, then model will assign 0 probability then it’s not possible to make predictions.

2. It assumes independent predictors. It is mostly not possible to get a data-set of independent predictors only.

3. It can not learn relation between the features as it uses independent features.

Practical Approach:

Now we will use Anaconda Jupyter notebook to solve one practical problem. In this case we are using saved data-set from sklearn i.e.fetch_20newsgroups data-set.

The fetch_20newsgroups data-set comprises of around 18000 newsgroup posts on 20 topics. This data-set is spilt into two ,one is training data-set(posted before specific date) and the other one is testing data-set(posted after specific date). For solving any Machine learning problem there are few steps to follow.

1. Import data

2. Data cleaning

3. Data Wrangling

4. Data visualization

5. Train test split

6. Use of suitable Machine learning Algorithm

Step 1: Importing Data

Let’s import data-set and save it.

Sklearn.datasets.fetch_20newsgroup function fetches data ,downloads the data archive from the original 20newsgroup website

Step 2: Data Cleaning Or Data Wrangling

Our next target is to get the more information about our data. We can check what are the topic included in data-set. From that we can conclude which topics to select .

For this we have used train_data variable to save the data and after that used function train_data.target_names to get detail information about our data.