Jan 2024 - May 2024
Sentiment Analysis using Naive Bayes
NLP classification pipeline for movie review sentiment.
This project builds a text classification pipeline that predicts review sentiment using preprocessing, tokenization, word frequency distributions, Laplace smoothing, and log-probability scoring.
Conceptual Visual
NLP Classification Pipeline
Raw Review
Preprocessing
Tokenization
Naive Bayes
Positive / Negative
Highlight
NLP Pipeline
Highlight
Naive Bayes Model
Highlight
Laplace Smoothing
Highlight
CLI Workflow
Executive Summary
This project builds a text classification pipeline that predicts review sentiment using preprocessing, tokenization, word frequency distributions, Laplace smoothing, and log-probability scoring.
Problem Statement
Raw text needs structured preprocessing and robust probability scoring before it can be classified reliably. This project demonstrates a foundational NLP workflow from data processing to prediction.
What I Built
Text preprocessing
Tokenization
Laplace smoothing
Configurable datasets
CLI execution
How It Works
A conceptual workflow showing how the project moves from input to processing and output.
Step 1
Dataset
Step 2
Cleaning
Step 3
Tokenization
Step 4
Word Frequency Training
Step 5
Log Probability Scoring
Step 6
Sentiment Prediction
Architecture / System Design
A simplified system view of the major project components and how responsibilities connect.
Step 1
Text Input
Step 2
Preprocessor
Step 3
Feature Extractor
Step 4
Naive Bayes Classifier
Step 5
Prediction Output
Technical Implementation
Preprocessing
- Lowercasing
- Punctuation removal
- Tokenization
Model
- Word frequency distributions
- Laplace smoothing
- Log-probability scoring
Workflow
- Configurable datasets
- CLI execution
- Positive/negative classification
Tools
- Python
- NLP fundamentals
- Probabilistic modeling
Visual Showcase
Conceptual preview panels for the project experience. These are intentional placeholders, not fake screenshots.
NLP Pipeline Diagram
Conceptual flow from raw text to sentiment prediction.
Tokenization Preview
Placeholder panel showing cleaned tokens prepared for modeling.
Probability Score Panel
Visual concept for comparing class-level log scores.
Classification Output Card
Clean result card for positive or negative prediction output.
Classification Preview
Input:
"The movie was surprisingly emotional and well acted."
Prediction:
Positive ReviewChallenges & Solutions
Challenge
Raw text is noisy and cannot be modeled directly.
Solution
Built a preprocessing pipeline for lowercasing, punctuation removal, and tokenization.
Challenge
Unseen words can break simple probability estimates.
Solution
Used Laplace smoothing and log-probability scoring for more stable classification.
Results / Impact
Demonstrates practical software engineering through modular structure, readable workflows, and clear technical documentation.
Shows ability to convert course and research concepts into working systems with real implementation constraints.