Basics of Natural Language Processing with scikit-learn and nltk

Aayushmaan Jain
3 min readJan 28, 2022

Natural language processing is a relatively new field which involves interaction between a computer and a natural language like English where the machine can learn the features of a language and can use the knowledge to perform some tasks like spam classification, fake news detection, sentiment analysis etc

For the purpose of this demonstration, I have used the SMS spam dataset from UCI Machine Learning Repository to build a simple SMS spam classifier and tested various machine learning models on it

Approach for the problem:

Flowchart for data preprocessing

Step 1 — Download the dataset from Kaggle and read it in the python file

Exploratory Data Analysis

Number of words in spam and ham emails

From this graph, we can see that spam emails usually have more words than ham emails

Proportions of spam and ham emails

From this graph, we can see that most of the emails are ham emails

Wordclouds for both spam and ham emails

From this graph, we can see the most common words in both spam and ham emails

Step 2 — Preprocessing

For the purpose of preprocessing the data, I have built a re-usable class which preprocesses the text which is easy to use and follows a similar scikit-learn type of syntax

  1. For normalization, I have converted the text into lower case, and I have removed all the links and mentions (in case of twitter data) using regular expressions (regex)
  2. Then for tokenization, I have used the inbuilt split function
  3. Then for removing stop words, I have removed all the stopwords available in the nltk corpus in the english language and also the punctuation symbols available in the string library
  4. Then for stemming/lemmatization, I chose to use the wordnet lemmatizer as it converts the words to their base form and does not rely on suffix stripping which yields a higher accuracy and f1 score on the dataset. You may read more on stemming and lemmatization here
  5. Then for vectorization, I have used the tf-idf vectorization technique as it preserves the semantic meaning by also considering the number of documents having the concerned word in it so if the word is common in almost every document, then it assigns a lower score to that word which leads to a higher accuracy. You may read more on tf-idf vectorization here.

Python code for preprocessing:

Python code for preprocessing the data

Model Building

For the purpose of model building, I have tried the following models

  1. Naive Bayes Classifier
  2. Decision Tree Classifier
  3. Random Forest Classifier
  4. XG Boost Classifier
Python code for training the models

Evaluating the performance of the models:

Since the dataset is imbalanced, I have used f1 score as the evaluation metric and from the graph, we can see that Logistic Regression and XG Boost yield a similar performance, so for the purpose of prediction, we will use XG Boost model

Model Performance

Predicting new message

For predicting a new message, we need to preprocess and vectorize the message and pass it through the chosen model to predict, but since we have a scikit-learn like syntax, we can just use the preprocess and transform functions of our class and predict the output from the model

The jupyter notebook is uploaded at the following places:




Aayushmaan Jain

A data science enthusiast currently pursuing a bachelor's degree in data science