Processing and understanding text allows extraction of meaningful information from raw data.
Text data can be structured into features that machine learning algorithms can analyze.
Machine learning approaches include supervised, unsupervised, and deep learning techniques.
AI models combine data and algorithms to identify patterns and generate actionable insights.

Text Classification and Categorization

Text classification organizes documents into predefined categories automatically.
Each document is represented by features derived from its words or phrases.
Applications include spam filtering, sentiment analysis, and news/topic categorization.
Effective classification relies on sufficient labeled data to train predictive models.

Supervised vs. Unsupervised

Supervised: documents have labels (e.g., spam, non-spam).
Unsupervised: documents grouped by similarity without labels.
Document classification is a general problem applicable to many use cases.

Practical Example

Preprocessing text
Extracting features
Training a classification model
Evaluating performance

Concept of Text Classification

Represent documents as features (words, phrases, embeddings).
Process: Preprocessing → Feature Extraction → Classification.

Simple Flow of Text Classification

Input Documents
Preprocessing (Cleaning, Tokenization)
Feature Extraction (BoW, TF-IDF, Embeddings)
Classification Model (SVM, Naive Bayes, Neural Net)
Predicted Category (Spam/Not Spam, etc.)

Mathematical Formulation

Suppose there are $n$ distinct words across all documents.
Each document $D$ can be represented as:

\[
D = (w_{1D}, w_{2D}, \dots, w_{nD})
\]

where $w_{iD}$ = weight of word $i$ in document $D$ (frequency, TF-IDF, etc.)

Common Feature Extraction Models

Bag of Words (BoW)
TF-IDF (Term Frequency–Inverse Document Frequency)
Word2Vec, GloVe, BERT embeddings

Example Movie Reviews

Positive:

I absolutely loved this movie. I loved the story and the characters.
The acting was amazing and the story was touching.
What a great experience. The visuals were memorable.
This movie was thrilling and full of suspense.
The cinematography and direction were excellent.

Negative:

I hated this film, it was boring and too long.
The plot was weak and the acting was terrible.
Such a disappointing movie, I regret watching it.
This was the worst movie I have seen.
Poor storyline and bad performances throughout.

BoW Representation

Review	loved (x1)	movie (x2)	boring (x4)	terrible (x5)	story (x10)	Class
1	2	1	0	0	1	Positive
6	0	1	1	0	0	Negative
10	0	0	0	1	0	Negative

Naive Bayes Classification

Naive Bayes uses Bayes’ Theorem:

\[
P(y \mid x) = \frac{P(y)\, P(x \mid y)}{P(x)}
\]

Since $P(x)$ is constant:

\[
\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y)^{x_i}
\]

Worked Example (Review 1)

Review: “I absolutely loved this movie. I loved the story and the characters.”

BoW vector: $x = [2,1,0,0,0,0,0,0,0,1]$

Positive class probability (approx):

\[
P(Positive \mid x) \propto 0.5 \cdot (0.182^2 \cdot 0.273 \cdot 0.182) \approx 0.00083
\]

Negative class probability:

\[
P(Negative \mid x) = 0
\]

Conclusion: Classified as Positive.

PDF

TextProcessing

Video

Reach PostNetwork Academy

Website: www.postnetwork.co
YouTube: www.youtube.com/@postnetworkacademy
Facebook: www.facebook.com/postnetworkacademy
LinkedIn: www.linkedin.com/company/postnetworkacademy
GitHub: www.github.com/postnetworkacademy

Text Classification with Bag of Words and Naive Bayes

Understanding Text with Machine Learning