Basics of Natural Language Processing with scikit-learn and nltk

Flowchart for data preprocessing
Number of words in spam and ham emails
Proportions of spam and ham emails
Wordclouds for both spam and ham emails
  1. For normalization, I have converted the text into lower case, and I have removed all the links and mentions (in case of twitter data) using regular expressions (regex)
  2. Then for tokenization, I have used the inbuilt split function
  3. Then for removing stop words, I have removed all the stopwords available in the nltk corpus in the english language and also the punctuation symbols available in the string library
  4. Then for stemming/lemmatization, I chose to use the wordnet lemmatizer as it converts the words to their base form and does not rely on suffix stripping which yields a higher accuracy and f1 score on the dataset. You may read more on stemming and lemmatization here
  5. Then for vectorization, I have used the tf-idf vectorization technique as it preserves the semantic meaning by also considering the number of documents having the concerned word in it so if the word is common in almost every document, then it assigns a lower score to that word which leads to a higher accuracy. You may read more on tf-idf vectorization here.
Python code for preprocessing the data
  1. Naive Bayes Classifier
  2. Decision Tree Classifier
  3. Random Forest Classifier
  4. XG Boost Classifier
Python code for training the models
Model Performance

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aayushmaan Jain

Aayushmaan Jain

A data science enthusiast currently doing bachelor's degree in data science