Basics of Natural Language Processing with scikit-learn and nltk

Flowchart for data preprocessing
Number of words in spam and ham emails
Proportions of spam and ham emails
Wordclouds for both spam and ham emails
  1. For normalization, I have converted the text into lower case, and I have removed all the links and mentions (in case of twitter data) using regular expressions (regex)
  2. Then for tokenization, I have used the inbuilt split function
  3. Then for removing stop words, I have removed all the stopwords available in the nltk corpus in the english language and also the punctuation symbols available in the string library
  4. Then for stemming/lemmatization, I chose to use the wordnet lemmatizer as it converts the words to their base form and does not rely on suffix stripping which yields a higher accuracy and f1 score on the dataset. You may read more on stemming and lemmatization here
  5. Then for vectorization, I have used the tf-idf vectorization technique as it preserves the semantic meaning by also considering the number of documents having the concerned word in it so if the word is common in almost every document, then it assigns a lower score to that word which leads to a higher accuracy. You may read more on tf-idf vectorization here.
Python code for preprocessing the data
  1. Naive Bayes Classifier
  2. Decision Tree Classifier
  3. Random Forest Classifier
  4. XG Boost Classifier
Python code for training the models
Model Performance

--

--

--

A data science enthusiast currently doing bachelor's degree in data science

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Convolutional Neural Network — An Informal Introduction (Part-1)

Semantics-Aware Autoencoder in Recommendation Scenarios

Architecture of a Semantics-Aware Autoencoder.

When not to use Neural Networks

Image result for neural network

Simple image classification on raspberry pi from pi-camera using the pre-trained model VGG16 and TF

Setting up Nvidia TX1 Dev board with JetPack 3.2 and SSD with a bonus

Manipulating class weights and decision threshold

Is Anomaly Detection Supervised or Un-supervised?

Introduction to Word Embeddings (NLP)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aayushmaan Jain

Aayushmaan Jain

A data science enthusiast currently doing bachelor's degree in data science

More from Medium

Introduction to Natural Language Processing (NLP)

NLP Preprocessing Steps in easy way

How to build NLP Pipeline..

Natural Language Processing: Do’s and Don’ts