Skip to content

daniel-j77/Machine-Learning-Based-Spam-Ham-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“© Machine Learning Based Spam & Ham Detection

An end-to-end Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS messages as Spam or Ham (Legitimate Message) using Logistic Regression and TF-IDF Vectorization.

The application is deployed using Streamlit, allowing users to test SMS messages in real time.


πŸš€ Live Demo

https://machine-learning-based-spam-ham-detection-yuewtcszp3runl7wssgn.streamlit.app/


πŸ“Œ Problem Statement

Mobile users frequently receive unwanted SMS messages, commonly known as spam. These messages may contain advertisements, scams, phishing attempts, or fraudulent content.

The objective of this project is to develop a machine learning model capable of automatically classifying SMS messages as:

  • Spam
  • Ham (Legitimate Message)

based on the text content of the message.


🎯 Project Objectives

  • Detect spam SMS messages automatically
  • Reduce unwanted message exposure
  • Apply NLP techniques to text data
  • Build and evaluate a machine learning classification model
  • Deploy the model for real-time predictions

πŸ“Š Dataset

SMS Spam Collection Dataset

Dataset contains:

  • Text β†’ SMS Content
  • Target β†’ Spam / Ham Label

πŸ” Exploratory Data Analysis (EDA)

Performed exploratory analysis to understand:

  • Message distribution
  • Character count
  • Word count
  • Sentence count
  • Spam vs Ham patterns

Feature Engineering

Created additional features:

  • Number of Characters
  • Number of Words
  • Number of Sentences

βš™οΈ Data Preprocessing

Text Cleaning

  • Removed special characters
  • Removed numbers
  • Removed extra spaces

Tokenization

  • Converted sentences into individual words

Stopword Removal

  • Removed common words with little semantic value

Lemmatization

  • Converted words into their root form

Example:

Running β†’ Run

Playing β†’ Play


πŸ“ˆ Outlier Detection

Applied:

  • Box Plot Analysis
  • Interquartile Range (IQR) Method

to identify and handle abnormal message lengths.


πŸ”€ Text Vectorization

Implemented:

TF-IDF Vectorization

Converted text data into numerical feature vectors suitable for machine learning algorithms.


βš–οΈ Handling Class Imbalance

Applied:

SMOTE (Synthetic Minority Oversampling Technique)

to balance Spam and Ham classes before model training.


πŸ€– Model Building

Algorithm Used:

Logistic Regression

Steps:

  1. Train-Test Split
  2. Model Training
  3. Prediction
  4. Hyperparameter Tuning using GridSearchCV

πŸ“‰ Model Evaluation

Evaluation Metrics:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • Confusion Matrix

These metrics were used to assess the effectiveness of spam detection.


πŸ› οΈ Technologies Used

  • Python
  • Pandas
  • NumPy
  • NLTK
  • Scikit-Learn
  • TF-IDF Vectorizer
  • Logistic Regression
  • SMOTE
  • Joblib
  • Streamlit

πŸ“ Project Structure

Machine-Learning-Based-Spam-Ham-Detection/
β”‚
β”œβ”€β”€ app.py
β”œβ”€β”€ logistic_regression_sms_spam_model.pkl
β”œβ”€β”€ tfidf_vectorizer.pkl
β”œβ”€β”€ label_encoder.pkl
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ spam.csv
└── README.md

πŸ’‘ Key Learnings

  • Natural Language Processing workflow
  • Text preprocessing techniques
  • TF-IDF feature extraction
  • Handling imbalanced datasets using SMOTE
  • Logistic Regression for text classification
  • Hyperparameter tuning with GridSearchCV
  • Streamlit deployment
  • End-to-end machine learning project lifecycle

πŸ‘¨β€πŸ’» Author

Daniel J

LinkedIn: https://www.linkedin.com/in/daniel-j77

GitHub: https://github.com/daniel-j77


⭐ Future Improvements

  • Deep Learning based spam detection
  • LSTM / RNN implementation
  • Transformer-based models (BERT)
  • Multilingual spam detection
  • Real-time SMS filtering API

Releases

No releases published

Packages

 
 
 

Contributors

Languages