Soojal Kumar
Back to Projects

Jan 2024 - May 2024

Sentiment Analysis using Naive Bayes

NLP classification pipeline for movie review sentiment.

This project builds a text classification pipeline that predicts review sentiment using preprocessing, tokenization, word frequency distributions, Laplace smoothing, and log-probability scoring.

PythonNLPMachine LearningText ProcessingNaive Bayes

Conceptual Visual

NLP Classification Pipeline

Raw Review

Preprocessing

Tokenization

Naive Bayes

Positive / Negative

Highlight

NLP Pipeline

Highlight

Naive Bayes Model

Highlight

Laplace Smoothing

Highlight

CLI Workflow

Executive Summary

This project builds a text classification pipeline that predicts review sentiment using preprocessing, tokenization, word frequency distributions, Laplace smoothing, and log-probability scoring.

Problem Statement

Raw text needs structured preprocessing and robust probability scoring before it can be classified reliably. This project demonstrates a foundational NLP workflow from data processing to prediction.

What I Built

Text preprocessing

Tokenization

Laplace smoothing

Configurable datasets

CLI execution

How It Works

A conceptual workflow showing how the project moves from input to processing and output.

Step 1

Dataset

Step 2

Cleaning

Step 3

Tokenization

Step 4

Word Frequency Training

Step 5

Log Probability Scoring

Step 6

Sentiment Prediction

Architecture / System Design

A simplified system view of the major project components and how responsibilities connect.

Step 1

Text Input

Step 2

Preprocessor

Step 3

Feature Extractor

Step 4

Naive Bayes Classifier

Step 5

Prediction Output

Technical Implementation

Preprocessing

  • Lowercasing
  • Punctuation removal
  • Tokenization

Model

  • Word frequency distributions
  • Laplace smoothing
  • Log-probability scoring

Workflow

  • Configurable datasets
  • CLI execution
  • Positive/negative classification

Tools

  • Python
  • NLP fundamentals
  • Probabilistic modeling

Visual Showcase

Conceptual preview panels for the project experience. These are intentional placeholders, not fake screenshots.

NLP Pipeline Diagram

Conceptual flow from raw text to sentiment prediction.

Tokenization Preview

Placeholder panel showing cleaned tokens prepared for modeling.

Probability Score Panel

Visual concept for comparing class-level log scores.

Classification Output Card

Clean result card for positive or negative prediction output.

Classification Preview

Input:
"The movie was surprisingly emotional and well acted."

Prediction:
Positive Review

Challenges & Solutions

Challenge

Raw text is noisy and cannot be modeled directly.

Solution

Built a preprocessing pipeline for lowercasing, punctuation removal, and tokenization.

Challenge

Unseen words can break simple probability estimates.

Solution

Used Laplace smoothing and log-probability scoring for more stable classification.

Results / Impact

Demonstrates practical software engineering through modular structure, readable workflows, and clear technical documentation.

Shows ability to convert course and research concepts into working systems with real implementation constraints.