Pandas VS SQL for Data Analysis
Often there is a debate about which one is better for Data Analysis, here we will generate random data and analyze it with both pandas as well as SQL to see which one is better for us.
In this case we shall be comparing both of them for ourselves by an example. We shall be analyzing the data provided to us on Pandas first and then we shall be analyzing the same on SQL.
First, let us see the schema of the data provided to us

Here we see that this is the data for a restaurant which has 3 tables
People table — This table contains information about the people who visit the restaurant. This table has 4 columns
- ID — This column stores the person ID for the person visiting the restaurant. This column is the primary key in the table which uniquely identifies each person
- AGE — The age of the person who visits the restaurant
- GENDER — The gender of the person who visits the restaurant
- CITY — The city where the person lives in
Preferences table — This table contains the information about the food preference of the people. This table has 2 columns
- ID — The person ID of the person visiting the restaurant. This column is the primary key in the preferences table which uniquely identifies each customer but this column is also the foreign key in the preference table which references the ID column in the people table. This is done because there cannot be an entry in the preferences table unless there is an entry in the people table. This is because only the customers who have visited the restaurant can answer the survey about their preferred food types
- FOOD_TYPE — This column stores the preferred food type of each customer
Job table — This table contains information about the profession of each customer. This table also has 2 columns
- ID — The person ID of the person visiting the restaurant. This column is the primary key in the preferences table which uniquely identifies each customer but this column is also the foreign key in the preference table which references the ID column in the people table. This is done because there cannot be an entry in the job table unless there is an entry in the people table. This is because only the customers who have visited the restaurant can answer the survey about their job
- JOB — This column stores the profession of each customer
Reading the data into python so that we can compare both SQL and python
Importing the necessary libraries and frameworks
Establishing a connection to MySQL database and reading the data
Manipulating the data for better analysis
We can combine the data from these dataframes so that we can have all the information at one place
From the schema we can see that we can divide the age column into various age groups and we can divide the city column into various states which will aid us in our analysis and then we also create separate dataframes for men and women which also aids us in the analysis
Now we save the combined dataframe to a csv file and we create a table in the SQL database which has all the information from the csv file so that we get the newly created columns also in the new table
We create the new table in the database by using this code
Beginning the analysis
Firstly we analyze the number of men and women in the entire population and their age groups and their population according to their age groups in each state
Python
SQL
Then we perform state wise and city wise analysis
Python
SQL
Then we perform job wise analysis
Python
SQL
Final Comparison
SQL is more efficient in querying data but it has less functions whereas in pandas, there might be lag for large volumes of data but it has more functions which enable us to manipulate data in an effective way. Python also has various data visualization libraries like matplotlib, seaborn, plotly, Altair etc which helps us to visualize hence helping us to analyze data more effectively.
SQL works on a relational model which makes linking tables via keys much easier and regulates the entry of data in other tables in case of a foreign key constraint but Pandas makes data manipulation much easier.
Pandas plotting also makes it easier for us to analyze data with it’s plotting features which provide a quick plot for us to get better insights of the data. The transform function of Pandas also helps us to apply a variety on functions. Pandas join function also helps us to join the data on the index and merge function works like SQL which enables us to join on a particular column present in both the dataframes.
Pandas is better if you intend to manipulate the data or plot it as it does it all in one place whereas in SQL we have to use Tableau for data visualization
Github repository of the code: repository
My website : website
Written by:
Aayushmaan Anvesh Jain
Thank you for reading the blog