Basics of Natural Language Processing with scikit-learn and nltk

Natural language processing is a relatively new field which involves interaction between a computer and a natural language like English where the machine can learn the features of a language and can use the knowledge to perform some tasks like spam classification, fake news detection, sentiment analysis etc

For the purpose of this demonstration, I have used the SMS spam dataset from UCI Machine Learning Repository to build a simple SMS spam classifier and tested various machine learning models on it

Approach for the problem:

Flowchart for data preprocessing

Step 1 — Download the dataset from Kaggle and read it in the python file

Exploratory Data Analysis

Number of words in spam and ham emails

From this graph, we can see that spam emails usually have more words than ham emails

Proportions of spam and ham emails

From this graph, we can see that most of the emails are ham emails

Wordclouds for both spam and ham emails

From this graph, we can see the most common words in both spam and ham emails

Step 2 — Preprocessing

For the purpose of preprocessing the data, I have built a re-usable class which preprocesses the text which is easy to use and follows a similar scikit-learn type of syntax

Python code for preprocessing:

Python code for preprocessing the data

Model Building

For the purpose of model building, I have tried the following models

Python code for training the models

Evaluating the performance of the models:

Since the dataset is imbalanced, I have used f1 score as the evaluation metric and from the graph, we can see that Logistic Regression and XG Boost yield a similar performance, so for the purpose of prediction, we will use XG Boost model

Model Performance

Predicting new message

For predicting a new message, we need to preprocess and vectorize the message and pass it through the chosen model to predict, but since we have a scikit-learn like syntax, we can just use the preprocess and transform functions of our class and predict the output from the model

The jupyter notebook is uploaded at the following places:

Kaggle

--

--

A data science enthusiast currently pursuing a bachelor's degree in data science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aayushmaan Jain

A data science enthusiast currently pursuing a bachelor's degree in data science