Explore a data science case study, showcasing data-driven strategies for exploring, cleaning and preparing data for statistical analysis and visualizations.

Project Foundations for Data Science: FoodHub Data Analysis¶

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer’s location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

order_id: Unique ID of the order
customer_id: ID of the customer who ordered the food
restaurant_name: Name of the restaurant
cuisine_type: Cuisine ordered by the customer
cost: Cost of the order
day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
rating: Rating given by the customer out of 5
food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant’s order confirmation and the delivery person’s pick-up confirmation.
delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person’s pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [9]:

# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

Understanding the structure of the data¶

In [10]:

#Access the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

In [31]:

# read the data
df = pd.read_csv('/content/drive/MyDrive/MIT-ADSP/0131-Foundations-Python-Statistics/Week 2/Project - Food Hub/foodhub_order.csv')
# returns the first 5 rows
df.head()

Out[31]:

	order_id	customer_id	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	1477147	337525	Hangawi	Korean	30.75	Weekend	Not given	25	20
1	1477685	358141	Blue Ribbon Sushi Izakaya	Japanese	12.08	Weekend	Not given	25	23
2	1477070	66393	Cafe Habana	Mexican	12.23	Weekday	5	23	28
3	1477334	106968	Blue Ribbon Fried Chicken	American	29.20	Weekend	3	25	15
4	1478249	76942	Dirty Bird to Go	American	11.59	Weekday	4	25	24

Observations:¶

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

Question 1: How many rows and columns are present in the data? [0.5 mark]¶

In [32]:

# Write your code here
rows, columns = df.shape
print(f"There are {rows} rows and {columns} columns present in the data.")

There are 1898 rows and 9 columns present in the data.

Observations:¶

There are 1898 rows and 9 columns present in the data. This means that there are 1898 orders that we can analyze.

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶

In [33]:

# Use info() to print a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

order_id, customer_id, food_preparation_time and delivery_time are integers.
restaurant_name, cuisine_type, day_of_the_week and rating are strings.
cost_of_ther_order is a float (decimal)
rating should be a number. I will convert that at some point.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶

In [34]:

# Write your code here
df.isnull().sum()

Out[34]:

	0
order_id	0
customer_id	0
restaurant_name	0
cuisine_type	0
cost_of_the_order	0
day_of_the_week	0
rating	0
food_preparation_time	0
delivery_time	0

dtype: int64

Observations:¶

There were no columns on any rows without values.
There are instances of certain rows with “rating” missing and instead having the string “Not given”. In order to address this, I will convert the rating values to integer values later in this workbook since the rest of the ratings are integers (in string format before).

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

In [35]:

# Write your code here
df.describe()

Out[35]:

	order_id	customer_id	cost_of_the_order	food_preparation_time	delivery_time
count	1.898000e+03	1898.000000	1898.000000	1898.000000	1898.000000
mean	1.477496e+06	171168.478398	16.498851	27.371970	24.161749
std	5.480497e+02	113698.139743	7.483812	4.632481	4.972637
min	1.476547e+06	1311.000000	4.470000	20.000000	15.000000
25%	1.477021e+06	77787.750000	12.080000	23.000000	20.000000
50%	1.477496e+06	128600.000000	14.140000	27.000000	25.000000
75%	1.477970e+06	270525.000000	22.297500	31.000000	28.000000
max	1.478444e+06	405334.000000	35.410000	35.000000	33.000000

In [36]:

# Write your code here
# Display only the statistics for "food_preparation_time"
food_prep_stats = df['food_preparation_time'].describe()
food_prep_stats

Out[36]:

	food_preparation_time
count	1898.000000
mean	27.371970
std	4.632481
min	20.000000
25%	23.000000
50%	27.000000
75%	31.000000
max	35.000000

dtype: float64
In [37]:

# Extract only the minimum, mean, and maximum values for "food_preparation_time"
food_prep_specific_stats = df['food_preparation_time'].agg(['min', 'mean', 'max']).round(2)
food_prep_specific_stats

Out[37]:

	food_preparation_time
min	20.00
mean	27.37
max	35.00

dtype: float64

Observations:¶

Food Prep Stats

Minimum Prep Time: 20 minutes
Average Prep Time: 27.37 minutes
Maximun Prep Time: 35 minutes

Question 5: How many orders are not rated? [1 mark]¶

In [38]:

# Write the code here
# Count the number of rows where "rating" is "Not given"
not_given_count = (df['rating'] == "Not given").sum()
not_given_count

Out[38]:

Observations:¶

There are 736 instances where rating was “Not given”.

In [40]:

# Convert 'Not given' to NaN
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# Calculate the mean rating per restaurant
mean_ratings_per_restaurant = df.groupby('restaurant_name')['rating'].mean().round()
# Use the map function to replace NaN in 'rating' with the mean rating of the respective restaurant
df['rating'] = df.apply(
    lambda row: mean_ratings_per_restaurant[row['restaurant_name']] if pd.isna(row['rating']) else row['rating'],
    axis=1
)
# Check if there are still NaN values and replace them directly without inplace=True
overall_mean_rating = round(df['rating'].mean())
df['rating'] = df['rating'].fillna(overall_mean_rating)
# Convert 'rating' to integer
df['rating'] = df['rating'].astype(int)
df.head()

Out[40]:

	order_id	customer_id	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	1477147	337525	Hangawi	Korean	30.75	Weekend	4	25	20
1	1477685	358141	Blue Ribbon Sushi Izakaya	Japanese	12.08	Weekend	4	25	23
2	1477070	66393	Cafe Habana	Mexican	12.23	Weekday	5	23	28
3	1477334	106968	Blue Ribbon Fried Chicken	American	29.20	Weekend	3	25	15
4	1478249	76942	Dirty Bird to Go	American	11.59	Weekday	4	25	24

In [41]:

# Count the number of rows where "rating" is "Not given" just to prove that
# we addressed that issue.
not_given_count = (df['rating'] == "Not given").sum()
not_given_count

Out[41]:

Observations:¶

I converted all ratings to integers and assigned the mean (targeted mean per restaurant based on other orders) rounded to the nearest integer where the value was “Not given”. I will use these ratings

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

In [42]:

# Let's eyball the girst 5 rows again
df.head()

Out[42]:

	order_id	customer_id	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	1477147	337525	Hangawi	Korean	30.75	Weekend	4	25	20
1	1477685	358141	Blue Ribbon Sushi Izakaya	Japanese	12.08	Weekend	4	25	23
2	1477070	66393	Cafe Habana	Mexican	12.23	Weekday	5	23	28
3	1477334	106968	Blue Ribbon Fried Chicken	American	29.20	Weekend	3	25	15
4	1478249	76942	Dirty Bird to Go	American	11.59	Weekday	4	25	24

In [51]:

# Set plot style
sns.set_style("whitegrid")
# Creating a histogram for the cost of the orders
plt.figure(figsize=(10, 6))
sns.histplot(df['cost_of_the_order'], bins=30, kde=True, color="blue")
plt.title('Distribution of Cost of the Orders')
plt.xlabel('Cost of the Order ($)')
plt.ylabel('Frequency')
plt.show()

No description has been provided for this image

Observations:¶

Skewed Distribution: The distribution of the cost of orders is right-skewed, indicating that most of the orders are concentrated in the lower price range, with fewer orders as the price increases.
Common Price Range: The most common price range for orders appears to be between approximately $10 and $20. This suggests that most customers opt for moderately priced items.

In [52]:

# Creating a box plot for the cost of the orders
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['cost_of_the_order'])
plt.title('Box Plot of the Cost of Orders')
plt.xlabel('Cost of the Order ($)')
plt.show()

No description has been provided for this image

Observations:¶

Central Tendency: The median cost of the orders is around $15-$20, suggesting that this is a typical amount customers spend per order.
Spread and Variability: The interquartile range (IQR), represented by the box, is relatively compact, indicating that the majority of orders cluster around the median price, within a moderate price range.

In [53]:

# Create a count plot for "rating"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['rating'], hue=df['rating'], palette="viridis", legend=False,
              order=sorted(df['rating'].unique()))
plt.title("Count of Each Rating")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

No description has been provided for this image

Observations:¶

Overall, ratings are good, with most being 4 and above.

In [45]:

# Create a boxplot for "food_preparation_time"
plt.figure(figsize=(8, 5))
sns.boxplot(y=df['food_preparation_time'], color="green")
plt.title("Boxplot of Food Preparation Time")
plt.ylabel("Food Preparation Time (minutes)")
plt.show()

No description has been provided for this image

Observations:¶

Median and Central Tendency: The median food preparation time is centered around 25 minutes. This indicates that half of the food orders are prepared in less than 25 minutes, and the other half takes longer, showcasing a balance in preparation times across the dataset.
Interquartile Range (IQR): The IQR, depicted by the box, is relatively narrow, which suggests that the majority of food preparation times are consistent and cluster around the median. This consistency might be indicative of standardized processes within restaurants or similar types of dishes being ordered.
Variability: The whiskers, which extend to the lowest and highest typical preparation times, show a wider range than the IQR. This variability outside the central cluster can indicate differences in restaurant efficiency, menu complexity, or operational challenges during peak times.
Operational Insights: Restaurants and the food delivery service can use this data to identify and investigate the reasons behind unusually long or short preparation times. Addressing these could improve overall efficiency, reduce customer wait times, and enhance satisfaction.

In [65]:

# Creating a count plot for delivery time
plt.figure(figsize=(10, 6))
sns.countplot(x='delivery_time', data=df, hue='delivery_time', dodge=False, palette='viridis')
plt.title('Count of Orders by Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.ylabel('Count of Orders')
plt.xticks(rotation=90)  # Rotating x-axis labels for better visibility
plt.legend().set_visible(False)  # Properly setting legend visibility
plt.show()

No description has been provided for this image
In [56]:

# Creating a box plot for delivery time
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['delivery_time'])
plt.title('Box Plot of Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.show()

No description has been provided for this image

Observations:¶

Median Delivery Time: The median delivery time is visually apparent in the plot, suggesting a typical delivery time that most orders adhere to. This median time provides a central benchmark for evaluating delivery efficiency.
Interquartile Range (IQR): The box, which encapsulates the middle 50% of delivery times, is relatively tight. This indicates that a majority of deliveries are consistent in duration, showing effective standardization and predictability in the delivery process.
Outliers: The plot reveals several outliers on the higher end, indicating deliveries that take significantly longer than usual. These outliers might be due to external factors such as traffic conditions, distance, or order complications.
Operational Insights: Identifying the reasons behind the longer delivery times could help in optimizing routes, improving delivery scheduling, and potentially selecting more reliable delivery methods or personnel. Similarly, analyzing why some deliveries are unusually quick could highlight best practices that could be replicated across the service.

In [49]:

# Create a count plot for "cuisine_type"
plt.figure(figsize=(10, 5))
sns.countplot(y=df['cuisine_type'], hue=df['cuisine_type'], palette="viridis", legend=False,
              order=df['cuisine_type'].value_counts().index)
plt.title("Count Plot of Cuisine Type")
plt.xlabel("Count")
plt.ylabel("Cuisine Type")
plt.show()

No description has been provided for this image

Observations:¶

Clearly the top cuisine types are American, Japanese and Italian.

In [50]:

# Create a count plot for "day_of_the_week"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['day_of_the_week'], hue=df['day_of_the_week'], palette="magma", legend=False,
              order=df['day_of_the_week'].value_counts().index)
plt.title("Count Plot of Day of the Week")
plt.xlabel("Day of the Week")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

No description has been provided for this image

Observations:¶

Clearly a lot more orders are placed during the weekend!

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶

In [66]:

# Write the code here
# Create a count plot for the top 5 restaurants
plt.figure(figsize=(10, 5))
sns.countplot(y=df['restaurant_name'], hue=df['restaurant_name'], palette="deep", legend=False,
              order=df['restaurant_name'].value_counts().index[:5])
plt.title("Top 5 Restaurants by Number of Orders")
plt.xlabel("Number of Orders")
plt.ylabel("Restaurant Name")
plt.show()

No description has been provided for this image

Observations:¶

The top 5 restaurants in terms of orders received are:

Shake Shack
The Meatball Shop
Blue Ribbon Sushi
Blue Ribbon Fried Chicken
Parm

Question 8: Which is the most popular cuisine on weekends? [1 mark]¶

In [67]:

# Write the code here
# Filter data for weekends (labeled as "Weekend" in the dataset)
weekend_data = df[df['day_of_the_week'] == 'Weekend']
# Create a count plot for cuisine types on weekends
# Create a count plot for cuisine types on weekends with hue assigned and ordered by popularity
plt.figure(figsize=(10, 5))
sns.countplot(y=weekend_data['cuisine_type'], hue=weekend_data['cuisine_type'],
              palette=sns.color_palette("magma", len(weekend_data['cuisine_type'].unique())),
              order=weekend_data['cuisine_type'].value_counts().index, dodge=False)
plt.title("Most Popular Cuisine on Weekends")
plt.xlabel("Number of Orders")
plt.ylabel("Cuisine Type")
plt.legend([],[], frameon=False)  # Hide legend
plt.show()
# Get the most popular cuisine type on weekends
popular_cuisine_weekends = weekend_data['cuisine_type'].value_counts().idxmax()
# Display the most popular cuisine type
popular_cuisine_weekends

No description has been provided for this image
Out[67]:

'American'

Observations:¶

Clearly the most popular cuisine during weekends is American.

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶

In [68]:

# Write the code here
# Calculate the percentage of orders that cost more than 20 dollars
total_orders = len(df)
orders_above_20 = len(df[df['cost_of_the_order'] > 20])
percentage_above_20 = (orders_above_20 / total_orders) * 100
percentage_above_20

Out[68]:

29.24130663856691

In [69]:

# Create a pie chart to visualize the percentage of orders above and below $20
plt.figure(figsize=(6, 6))
labels = ["Orders > $20", "Orders ≤ $20"]
sizes = [orders_above_20, total_orders - orders_above_20]
colors = ["#FF9999", "#66B2FF"]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title("Percentage of Orders Above and Below $20")
plt.show()

No description has been provided for this image

Observations:¶

About 29% of the orders cost more than 20 dollars. The rest (~71%) are below.

Question 10: What is the mean order delivery time? [1 mark]¶

In [70]:

# Write the code here
mean_delivery_time = df['delivery_time'].mean()
mean_delivery_time

Out[70]:

24.161749209694417

In [71]:

# Create a histogram with mean line to visualize delivery time distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['delivery_time'], kde=True, bins=20, color="blue", alpha=0.7)
plt.axvline(mean_delivery_time, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_delivery_time:.2f} min')
plt.title("Distribution of Delivery Time with Mean Indicator")
plt.xlabel("Delivery Time (minutes)")
plt.ylabel("Frequency")
plt.legend()
plt.show()

No description has been provided for this image

Observations:¶

The mean delivery time is 24.16 min.

The peak of the histogram suggests that most delivery times are close to the mean.
This indicates a relatively consistent delivery time for most orders.

The histogram is right-skewed (longer tail on the right), which means that some orders take significantly longer than the average.

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶

In [72]:

# Write the code here
top_3_customers = df['customer_id'].value_counts().head(3)
top_3_customers

Out[72]:

	count
customer_id
52832	13
47440	10
83287	9

dtype: int64
In [73]:

# Create a bar chart for the top 3 most frequent customers
plt.figure(figsize=(8, 5))
sns.barplot(x=top_3_customers.index, y=top_3_customers.values,
            hue=top_3_customers.index, palette="coolwarm", legend=False)
plt.title("Top 3 Most Frequent Customers")
plt.xlabel("Customer ID")
plt.ylabel("Number of Orders")
plt.show()

No description has been provided for this image

Observations:¶

The above charts show the top 3 customers.

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

In [79]:

# Calculating total time from order to delivery by summing food preparation time and delivery time
df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time']
# Creating a heatmap to visualize the correlation between total time, cost of the order, and rating
heatmap_data = df[['total_time_from_order_to_delivery', 'cost_of_the_order', 'rating']]
correlation = heatmap_data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap between Total Time, Cost, and Rating')
plt.show()

No description has been provided for this image

Observations:¶

Total Time vs. Rating: The correlation between total time from order to delivery and rating is very low, as indicated by a coefficient close to zero. This reinforces the observation from the scatter plot that total time does not significantly influence the ratings directly.
Cost vs. Rating: Similarly, the correlation between cost of the order and rating is also very low. This suggests that higher costs are not directly associated with higher customer satisfaction (or dissatisfaction), at least not linearly.
Cost vs. Total Time: There is a negligible correlation between the cost of the order and the total time from order to delivery. This indicates that longer preparation and delivery times are not necessarily associated with higher costs.

In [84]:

# 1. Cuisine Type Analysis - Comparing average delivery and preparation times across cuisines
# Aggregating average times per cuisine
cuisine_times = df.groupby('cuisine_type')[['food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 cuisines to prevent memory overload
top_cuisines = df['cuisine_type'].value_counts().index[:20]
cuisine_times_filtered = cuisine_times[cuisine_times['cuisine_type'].isin(top_cuisines)]
# Plotting average food preparation time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='food_preparation_time', hue='cuisine_type', data=cuisine_times_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='delivery_time', hue='cuisine_type', data=cuisine_times_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()

No description has been provided for this image

Observations:¶

Food Preparation Time by Cuisine Type:

Certain cuisines take significantly longer to prepare than others.
The variation in preparation time could be due to differences in dish complexity, cooking techniques, or restaurant efficiency.
Delivery Time by Cuisine Type:
Delivery times also vary across cuisines, which may be influenced by restaurant locations, demand levels, or how well the food travels.
Some cuisines might be more frequently ordered from farther distances, leading to longer delivery times.

In [88]:

# 2. Restaurant Performance Analysis - Comparing average ratings, delivery times, and preparation times across restaurants
# Aggregating performance metrics per restaurant
restaurant_performance = df.groupby('restaurant_name')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 most popular restaurants to prevent memory overload
top_restaurants = df['restaurant_name'].value_counts().index[:20]
restaurant_performance_filtered = restaurant_performance[restaurant_performance['restaurant_name'].isin(top_restaurants)]
# Plotting average ratings by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='rating', hue='restaurant_name', data=restaurant_performance_filtered, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.show()
# Plotting average food preparation time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='food_preparation_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='delivery_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()

No description has been provided for this image

Observations¶

Average Ratings by Restaurant:

Some restaurants consistently receive higher ratings than others, indicating better customer satisfaction.
The variation in ratings may be influenced by factors such as food quality, service efficiency, and pricing.
Average Food Preparation Time by Restaurant:
Certain restaurants take significantly longer to prepare food than others.
This could be due to the type of cuisine, restaurant efficiency, or the complexity of menu items.
Average Delivery Time by Restaurant:
Some restaurants have longer delivery times on average, which could be due to location, demand, or delivery efficiency.
Identifying which restaurants have consistently longer delivery times may help FoodHub optimize delivery logistics.

In [92]:

# 3. Day of the Week Analysis - Comparing ratings, preparation times, and delivery times with hue set to x
# Aggregating performance metrics per day of the week
day_performance = df.groupby('day_of_the_week')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Plotting average ratings by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='rating', hue='day_of_the_week', data=day_performance, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rating')
plt.show()
# Plotting average food preparation time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='food_preparation_time', hue='day_of_the_week', data=day_performance, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.show()
# Plotting average delivery time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', hue='day_of_the_week', data=day_performance, palette='magma', dodge=False)
plt.title('Average Delivery Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Delivery Time (minutes)')
plt.show()

No description has been provided for this image

Observations:¶

Average Ratings by Day of the Week:

Ratings remain relatively consistent across weekdays and weekends, indicating stable customer satisfaction regardless of the day.
Average Food Preparation Time by Day of the Week:
There is a slight increase in preparation times on weekends, suggesting a higher volume of orders or operational constraints.
Restaurants might experience increased kitchen workloads on weekends, leading to slightly slower preparation.
Average Delivery Time by Day of the Week:
Delivery times show a similar pattern, with weekends experiencing slightly longer durations.
Higher weekend demand could be affecting overall delivery efficiency.

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶

In [93]:

# Write the code here
# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
qualified_restaurants

Out[93]:

	count	mean
restaurant_name
Blue Ribbon Fried Chicken	96	4.218750
Blue Ribbon Sushi	119	4.134454
Parm	68	4.073529
RedFarm Broadway	59	4.169492
RedFarm Hudson	55	4.109091
Shake Shack	219	4.168950
The Meatball Shop	132	4.689394

In [96]:

# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria (more than 50 ratings and average rating above 4)
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
# Sort the qualified restaurants by average rating in descending order
qualified_restaurants_sorted = qualified_restaurants.sort_values(by='mean', ascending=False).reset_index()
# Create a bar chart with sorted restaurants and assign the y variable to hue
plt.figure(figsize=(10, 5))
sns.barplot(y='restaurant_name', x='mean', hue='restaurant_name', data=qualified_restaurants_sorted, palette="coolwarm", dodge=False)
plt.title("Restaurants Eligible for Promotional Offer (Sorted by Avg Rating)")
plt.xlabel("Average Rating")
plt.ylabel("Restaurant Name")
plt.xlim(4, 5)  # Set x-axis range to focus on valid rating values
plt.show()

No description has been provided for this image

Observations:¶

This bar chart displays restaurants that qualify for a promotional offer based on:
Having more than 50 ratings (ensuring enough data points for reliability).
Maintaining an average rating above 4.0.
Restaurants are sorted in descending order of average rating, highlighting the best-rated ones.
The hue is assigned to the restaurant name, ensuring distinct colors for each.

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

In [97]:

# Write the code here
# Define commission rates based on order cost
def calculate_commission(cost):
    if cost > 20:
        return cost * 0.25  # 25% commission for orders above $20
    elif cost > 5:
        return cost * 0.15  # 15% commission for orders above $5
    else:
        return 0  # No commission for orders $5 or below
# Apply the commission calculation to each order
df['commission'] = df['cost_of_the_order'].apply(calculate_commission)
# Calculate total revenue generated by the company
total_revenue = df['commission'].sum()
total_revenue

Out[97]:

6166.303

Observations:¶

Net Revenue Generated by the Company:
The total revenue generated by the company from commissions across all orders is $6,166.30.

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

In [100]:

# Write the code here
# Calculate total time from order placement to delivery
df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time']
# Count total number of orders
total_orders = len(df)
# Count orders that took more than 60 minutes
orders_above_60_min = df[df['total_time_from_order_to_delivery'] > 60].shape[0]
# Calculate percentage of orders taking more than 60 minutes
percentage_above_60_min = (orders_above_60_min / total_orders) * 100
round(percentage_above_60_min, 2)

Out[100]:

10.54

Observations:¶

10.54% of orders take more than 60 minutes from order placement to delivery.

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

In [102]:

# Write the code here
# Calculate the mean delivery time for weekdays and weekends
delivery_time_analysis = df.groupby('day_of_the_week')['delivery_time'].mean().reset_index()
# Plotting the variation in mean delivery time
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', data=delivery_time_analysis, palette='magma', hue='day_of_the_week', dodge=False)
plt.title('Mean Delivery Time on Weekdays vs. Weekends')
plt.xlabel('Day of the Week')
plt.ylabel('Mean Delivery Time (minutes)')
plt.show()
# Display mean delivery times for weekdays and weekends
delivery_time_analysis

No description has been provided for this image
Out[102]:

	day_of_the_week	delivery_time
0	Weekday	28.340037
1	Weekend	22.470022

Observations:¶

The mean delivery time is lower on weekends than on weekdays.

Possible reasons:
Lower traffic congestion on weekends, leading to faster deliveries.
More efficient delivery management due to higher expected demand on weekends.
Fewer business district orders (which might take longer due to distance or traffic).

Conclusion and Recommendations¶

Conclusions:¶

My analysis of FoodHub’s order and delivery data provides valuable insights into restaurant performance, customer satisfaction, and operational efficiency. Key findings include:

Delivery & Preparation Time Insights

10.54% of orders take more than 60 minutes from order placement to delivery, which could negatively impact customer satisfaction.
Mean delivery time is longer on weekdays (28.34 min) than on weekends (22.47 min), suggesting that traffic congestion or restaurant availability might be factors affecting delivery speed.
Some cuisine types and restaurants have significantly longer preparation times, which contributes to overall delays.

Restaurant & Cuisine Performance

Certain restaurants consistently receive high ratings (above 4.0) with more than 50 orders, making them strong candidates for promotional offers.
Some cuisines have higher preparation times than others, affecting overall delivery efficiency.
The most profitable orders come from customers who spend more than $20, as they generate the highest commissions for FoodHub.

Customer Satisfaction & Revenue

Customer ratings are not strongly correlated with delivery time, suggesting that food quality and overall experience play a bigger role in satisfaction.
The company generated 6,166.30 dollars in commission revenue, with most revenue coming from orders above $20.
Certain restaurants consistently take longer to prepare food, which might contribute to lower ratings.

Recommendations:¶

Improve Delivery & Operational Efficiency

Optimize restaurant partnerships: Identify restaurants with consistent delays and work with them to improve preparation efficiency.
Prioritize high-performing restaurants: Promote and feature restaurants with high ratings and good efficiency to improve customer satisfaction.
Reduce weekday delivery delays: Explore better delivery scheduling and traffic-optimized routing for faster weekday deliveries.

Enhance Customer Satisfaction & Engagement

Encourage high-rated restaurants: Offer incentives or priority placement in the app for restaurants maintaining an average rating above 4.0.
Improve customer communication: Notify customers in real-time if orders are expected to take longer than 60 minutes to manage expectations.
Introduce loyalty rewards: Offer discounts or free delivery for customers who frequently order from top-rated restaurants.

Revenue Optimization Strategies

Increase commissions on premium orders: Since most revenue comes from orders above $20, consider adjusting pricing models to encourage higher-value purchases.
Feature premium & high-performing cuisines: If certain cuisines generate higher-value orders, highlight them in the app for better visibility.
Reduce inefficiencies in long-prep-time cuisines: Work with restaurants offering slow-prep cuisines to improve kitchen efficiency and reduce preparation times.

#DataScience AI DataAnalysis MachineLearning Python Statistics

Like this:

Related

Data Science: Food Hub Data Analysis

Project Foundations for Data Science: FoodHub Data Analysis¶

Context¶

Objective¶

Data Description¶

Data Dictionary¶

Let us start by importing the required libraries¶

Understanding the structure of the data¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Question 1: How many rows and columns are present in the data? [0.5 mark]¶

Observations:¶

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶

Observations:¶

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Question 5: How many orders are not rated? [1 mark]¶

Observations:¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Observations:¶

Observations:¶

Observations:¶

Observations:¶

Observations:¶

Observations:¶

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶

Observations:¶

Question 8: Which is the most popular cuisine on weekends? [1 mark]¶

Observations:¶

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶

Observations:¶

Question 10: What is the mean order delivery time? [1 mark]¶

Observations:¶

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶

Observations:¶

Observations:¶

Observations¶

Observations:¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶

Observations:¶

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶

Observations:¶

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶

.dataframe tbody tr th:only-of-type {<br /> vertical-align: middle;<br /> }<br /> .dataframe tbody tr th {<br /> vertical-align: top;<br /> }<br /> .dataframe thead th {<br /> text-align: right;<br /> }<br />

Observations:¶

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶

Conclusions:¶

Recommendations:¶

Share this:

Like this:

Related

Mario Mamalis

Breaking the Code

You may also like

Breaking the Code

Navigating the AI Landscape

Case Study: Transformative Digital Solutions

Azure Landing Zones

Durable Functions Workflows

Durable Functions Fundamentals

Leave a ReplyCancel reply