Text Classification with Bag of Words and Naive Bayes



Text Classification with Bag of Words and Naive Bayes

Author: Bindeshwar Singh Kushwaha | PostNetwork Academy


Understanding Text with Machine Learning

  • Processing and understanding text allows extraction of meaningful information from raw data.
  • Text data can be structured into features that machine learning algorithms can analyze.
  • Machine learning approaches include supervised, unsupervised, and deep learning techniques.
  • AI models combine data and algorithms to identify patterns and generate actionable insights.

Text Classification and Categorization

  • Text classification organizes documents into predefined categories automatically.
  • Each document is represented by features derived from its words or phrases.
  • Applications include spam filtering, sentiment analysis, and news/topic categorization.
  • Effective classification relies on sufficient labeled data to train predictive models.

Supervised vs. Unsupervised

  • Supervised: documents have labels (e.g., spam, non-spam).
  • Unsupervised: documents grouped by similarity without labels.
  • Document classification is a general problem applicable to many use cases.

Practical Example

  1. Preprocessing text
  2. Extracting features
  3. Training a classification model
  4. Evaluating performance

Concept of Text Classification

Represent documents as features (words, phrases, embeddings).
Process: Preprocessing → Feature Extraction → Classification.

Simple Flow of Text Classification

 

  1. Input Documents
  2. Preprocessing (Cleaning, Tokenization)
  3. Feature Extraction (BoW, TF-IDF, Embeddings)
  4. Classification Model (SVM, Naive Bayes, Neural Net)
  5. Predicted Category (Spam/Not Spam, etc.)

Mathematical Formulation

Suppose there are $n$ distinct words across all documents.
Each document $D$ can be represented as:

\[
D = (w_{1D}, w_{2D}, \dots, w_{nD})
\]

where $w_{iD}$ = weight of word $i$ in document $D$ (frequency, TF-IDF, etc.)

Common Feature Extraction Models

  • Bag of Words (BoW)
  • TF-IDF (Term Frequency–Inverse Document Frequency)
  • Word2Vec, GloVe, BERT embeddings

Example Movie Reviews

Positive:

  • I absolutely loved this movie. I loved the story and the characters.
  • The acting was amazing and the story was touching.
  • What a great experience. The visuals were memorable.
  • This movie was thrilling and full of suspense.
  • The cinematography and direction were excellent.

Negative:

  • I hated this film, it was boring and too long.
  • The plot was weak and the acting was terrible.
  • Such a disappointing movie, I regret watching it.
  • This was the worst movie I have seen.
  • Poor storyline and bad performances throughout.

BoW Representation

Review loved (x1) movie (x2) fantastic (x3) boring (x4) terrible (x5) great (x6) excellent (x7) worst (x8) acting (x9) story (x10) Class
1 2 1 0 0 0 0 0 0 0 1 Positive
6 0 1 0 1 0 0 0 0 0 0 Negative
10 0 0 0 0 1 0 0 0 0 0 Negative

Naive Bayes Classification

Naive Bayes uses Bayes’ Theorem:

\[
P(y \mid x) = \frac{P(y)\, P(x \mid y)}{P(x)}
\]

Since $P(x)$ is constant:

\[
\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y)^{x_i}
\]

Worked Example (Review 1)

Review: “I absolutely loved this movie. I loved the story and the characters.”

BoW vector: $x = [2,1,0,0,0,0,0,0,0,1]$

Positive class probability (approx):

\[
P(Positive \mid x) \propto 0.5 \cdot (0.182^2 \cdot 0.273 \cdot 0.182) \approx 0.00083
\]

Negative class probability:

\[
P(Negative \mid x) = 0
\]

Conclusion: Classified as Positive.

PDF

TextProcessing

Video


Reach PostNetwork Academy

  • Website: www.postnetwork.co
  • YouTube: www.youtube.com/@postnetworkacademy
  • Facebook: www.facebook.com/postnetworkacademy
  • LinkedIn: www.linkedin.com/company/postnetworkacademy
  • GitHub: www.github.com/postnetworkacademy

Thank You!

©Postnetwork-All rights reserved.