End to End Machine Learning Project

The best way to enhance one’s skill in a particular field is by practicing that particular skill by using that skill in a real world scenario. I have tried to use my skill by aiming to create a web application which gives an estimate of the rent prices in a particular locality of a particular city based on the inputs given by the user using machine learning models trained for that particular city.

By far and large, I had noticed that there isn’t much work done in the field of real estate using machine learning as far as Indian scenario is concerned and the websites which exist like magicbricks.com, makaan.com etc are way too granular and require the user to give a lot of input which the user who is planning to migrate to a particular city may not know.

The main motivation behind the project was to create a web app which uses machine learning and gives a good estimate of the rent prices according to the inputs given. The main focus of this web app was to provide a simple user interface along with accurate results.

For the purpose of this project, I have used the dataset from Kaggle. This dataset contains housing prices for 8 different cities in India

  1. Mumbai
  2. Delhi
  3. Kolkata
  4. Bangalore
  5. Hyderabad
  6. Chennai
  7. Ahmedabad
  8. Pune

The data used in this project was web scraped from makaan.com and the original source of the data was a dataset uploaded on Kaggle titled house rent prices of metropolitan cities in India.

For the purpose of this project, I have used the $300 credit from google cloud and created 2 resources in a project

  1. Compute engine for maintaining a server
  2. SQL Database for maintaining a dynamic database on the cloud

You may want to take care of the following points while creating the resources

  1. Allowing http and https traffic on the compute engine along with necessary ports which you may need (because after deploying, the users are going to use https protocol to access the website)
# Allowing SSH rule on our server
ufw allow OpenSSH
# Allowing http traffic on our server
ufw allow http
# Allowing https traffic on our server
ufw allow https
# Enabling the firewall on our server
ufw enable

2. Authorizing the public IP address of your personal machine and the server you have created on google cloud in the SQL database so that you can connect from the PC or server

You may want to create a separate security group and authorize some devices to secure your database from unwanted connections

For the purpose of this project, I have used the _All_Cities_Cleaned.csv file which was available in the dataset from Kaggle

Although this file was cleaned, it still required further preprocessing.

After cleaning and preprocessing the file, I created 2 SQL files which contain insert queries for SQL so that the data can be read dynamically and the models can be updated accordingly.

Initially I needed to run the SQL files in MySQL workbench to load the data but then, the inputs from the users were inserted into the table by using the insert queries so that the model trains on the updated data

Data Preprocessing Pipeline

Python code for preprocessing

For the purpose of EDA, I have loaded the cleaned and preprocessed data from SQL.

Then I appended the city column to each DataFrame to denote the city which the data was from and Affordability column (which was given by price/area) to denote the affordability of houses in each city

Then I proceeded to analyze the number of houses rented in each city and found out that most houses were being rented in Mumbai, Delhi and Pune maybe because Mumbai is the financial capital of India, Delhi is the political capital of India and Pune is famous as Oxford of the East for its educational institutes

Number of houses rented in each city

Then I plotted the average price of houses in each city to find out which city had the most expensive houses and I found out that Delhi and Mumbai had the most expensive houses

Average price of rent houses in each city

Then I decided to plot the average area of houses in each city to find out whether the houses in each city are priced appropriately according to the area. I found out that the houses in Delhi, Ahmedabad, and Hyderabad are the most spacious houses

Average area of houses in each city

After plotting the prices and areas of houses in each city, I decided to plot the affordability of houses in each city to find out the most affordable cities in the dataset, the lesser the price per square feet, more affordable the houses in that city are. After plotting the affordability of houses in each city, I found out that Ahmedabad, Kolkata and Hyderabad are the most affordable cities in the dataset

Affordability of houses in each city

Then to analyze the data at a deeper level, I plotted the categorical/textual columns [‘SELLER TYPE’,’LAYOUT TYPE’,’PROPERTY TYPE’,’FURNISH TYPE’] as a pie chart to see the proportion of each category of each column in each city as a 2x2 plot with text annotated on the side

Pie charts for Mumbai city

Then I decided to plot the numerical columns as 2x2 grid where in the top row, there were distributions of price and area of houses in that city and in the bottom row, there were the histograms of the number of bedrooms and number of bathrooms in each city.

Numerical analysis for Mumbai city

Then I decided to plot 10 most affordable localities and 10 least affordable localities in each city side by side. The criteria for most and least affordable localities was the average of the affordability column in the data of that particular city grouped by the locality

Affordability of houses in Mumbai city

Then I decided to plot 10 most spacious localities and 10 least spacious localities in each city side by side. The criteria for most and least spacious localities was the average of the area column in the data of that particular city grouped by the locality.

Spaciousness of houses in Mumbai city

Python code for Exploratory Data Analysis

Now that we have preprocessed and analyzed the data, we are now ready to move forward to the main element of the project which is building the Machine Learning model which will then power our web app in the backend.

For the purpose of this project, since the problem is a regression problem, I have analyzed my model on the basis of R2 score and Mean Absolute Error

I have tried the following models for this project

  1. Linear Regression
  2. Decision Tree Regression
  3. Random Forest Regression
  4. Adaboost Regression
  5. Gradient Boost Regression
  6. XGBoost Regression

From the following models, I found out that XGBoost Regressor was the model which had the least Mean Absolute Error and the most R2 score on both train and test sets

R2 score of all models on both train and test sets for each city
MAE score of all models on both train and test sets for each city

Since XGBoost was the best model, we will try hyperparameter tuning on XGBoost Regressor model.

After trying hyperparameter tuning, we found that the validated model was not showing much improvement, hence we will use the original XGBoost model

Comparison of models before and after hyperparameter tuning

Python code for model building

Since now we have created the models, we will now create a web app with various endpoints to show the analysis and information about each city to the end users and will provide a simple user interface with our accurate Machine Learning models.

Python code for creating the web app using Flask

Since now we have trained the model once, the model needs to be continuously retrained on new data every month, for that I have created a python script which retrains the model and overwrites the updated graphs

Python code for retraining the model

For deploying the model, I created a server on Linode and deployed the app using nginx and gunicorn and then linked it to a domain using namecheap.

For getting a domain, you need to buy a domain from any domain provider, then configure the nameservers according to the server provider you are using and then you need to configure the DNS records to point the domain to your server (basically you need to configure for 2 hosts — www and blank host so that if anyone enters www.YOUR_DOMAIN_NAME.com or YOUR_DOMAIN_NAME.com, the user is redirected to the IP address of your server)

For obtaining the SSL certificates, I used the free non-profit certificate provider Let’s Encrypt. Then for retraining the model every month, I used the crontab utility available in Ubuntu.

Here are some commands that I used

Web app

Source Code

A data science enthusiast currently doing bachelor's degree in data science