Sentiment Analysis with SVM + Light Stemming + Chi‑Square

Sentiment Analysis with SVM + Light Stemming + Chi‑Square

Four‑class tweet classifier using TF‑IDF, Chi‑Square feature selection, and SVM.

Pythonscikit‑learnNLTKPandasSVMTF‑IDFChi‑Square
#machine learning #python #scikit‑learn #nltk #pandas #svm #tf‑idf #chi‑square

Project Overview

Classifies tweets into Negative, Irrelevant, Neutral, and Positive. Pipeline: light stemming + stopword removal → TF‑IDF n‑grams → Chi‑Square feature selection → linear SVM. Saved artifacts (model, vectorizer, selector) enable reproducible predictions.

Covers

slide 1

Project cover (use #file:sentiment).

Project Structure

Datasets (training/validation), scripts to train and predict, and saved artifacts (svm_model.pkl, vectorizer.pkl, chi2_selector.pkl) for reuse.

sentiment-analysis/
├─ data/
│  ├─ twitter_training.csv
│  └─ twitter_validation.csv
├─ models/
│  ├─ svm_model.pkl
│  ├─ vectorizer.pkl
│  └─ chi2_selector.pkl
├─ scripts/
│  ├─ train_model.py
│  └─ predict_sentiment.py
└─ README.md

Light Preprocessing & Stemming

Normalize text by removing numbers, tokenizing, lowercasing, removing stopwords, then apply light stemming (Porter). The same function is reused at inference.

import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()

def preprocess_text(text: str) -> str:
    text = re.sub(r'\d+', '', text)                # strip numbers
    tokens = tokenizer.tokenize(text.lower())       # tokenize + lowercase
    filtered = [w for w in tokens if w not in stop_words]     # drop stopwords
    stemmed  = [stemmer.stem(w) for w in filtered]  # light stemming
    return ' '.join(stemmed)

TF‑IDF + Chi‑Square Feature Selection

Convert text to numerical features using TF‑IDF with uni/bi‑grams, then keep only the k best features by Chi‑Square to reduce noise and improve generalization.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(train_df['text'])
X_valid_tfidf = vectorizer.transform(valid_df['text'])

chi2_selector = SelectKBest(chi2, k=3000)
X_train = chi2_selector.fit_transform(X_train_tfidf, train_df['sentiment'])
X_valid = chi2_selector.transform(X_valid_tfidf)

Model Training & Persistence (SVM)

Train a linear SVM and save the model, vectorizer, and selector for reproducible inference.

from sklearn.svm import SVC
import pickle, os

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, train_df['sentiment'])

os.makedirs('models', exist_ok=True)
pickle.dump(svm_model, open('models/svm_model.pkl', 'wb'))
pickle.dump(vectorizer, open('models/vectorizer.pkl', 'wb'))
pickle.dump(chi2_selector, open('models/chi2_selector.pkl', 'wb'))

Evaluation

Report accuracy and per‑class metrics on the validation set.

from sklearn.metrics import classification_report, accuracy_score

y_pred = svm_model.predict(X_valid)
print(classification_report(valid_df['sentiment'], y_pred))
print('Accuracy:', accuracy_score(valid_df['sentiment'], y_pred))  # ~0.78 in our run

Prediction API (Script)

Load artifacts, run the same preprocessing, transform with TF‑IDF and Chi‑Square, then predict and map back to labels.

import pickle
from pathlib import Path

LABELS = {0: 'Negative', 1: 'Irrelevant', 2: 'Neutral', 4: 'Positive'}

def load_artifacts(base='models'):
    base = Path(base)
    model = pickle.load(open(base/'svm_model.pkl', 'rb'))
    vec   = pickle.load(open(base/'vectorizer.pkl', 'rb'))
    sel   = pickle.load(open(base/'chi2_selector.pkl', 'rb'))
    return model, vec, sel

def predict_sentiment(text: str) -> str:
    model, vec, sel = load_artifacts()
    processed = preprocess_text(text)
    feats = sel.transform(vec.transform([processed]))
    pred = model.predict(feats)[0]
    return LABELS[pred]

print(predict_sentiment("I love this product! Highly recommended."))

Challenges

Noisy tweets, mixed casing, emojis, and imbalance between classes. Needed compact features for generalization and a deterministic pipeline that can be serialized and reused.

Solutions

Applied light normalization (lowercase, number removal, tokenization, stopword removal) and Porter stemming. Extracted uni/bi‑gram TF‑IDF features, then reduced dimensionality with Chi‑Square (k=3000). Trained linear SVM; persisted model, vectorizer, and selector as PKL files. Provided a matching preprocessing function to avoid train/inference skew.

Results

On validation set of 1000 tweets, SVM reached ~78% accuracy with balanced precision/recall across four classes. The artifacts enable quick deployment as API or CLI.

Technologies Used

Pythonscikit‑learnNLTKPandasSVMTF‑IDFChi‑Square