Data Science: Food Hub Data Analysis

by Mario Mamalis
Data Science Analysis

Explore a data science case study, showcasing data-driven strategies for exploring, cleaning and preparing data for statistical analysis and visualizations.

Project Foundations for Data Science: FoodHub Data Analysis

Context

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer’s location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant’s order confirmation and the delivery person’s pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person’s pick-up confirmation and drop-off information

Let us start by importing the required libraries

In [9]:

# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

Understanding the structure of the data

In [10]:

#Access the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

In [31]:

# read the data
df = pd.read_csv('/content/drive/MyDrive/MIT-ADSP/0131-Foundations-Python-Statistics/Week 2/Project - Food Hub/foodhub_order.csv')
# returns the first 5 rows
df.head()

Out[31]:

order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

Observations:

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

Question 1: How many rows and columns are present in the data? [0.5 mark]

In [32]:

# Write your code here
rows, columns = df.shape
print(f"There are {rows} rows and {columns} columns present in the data.")
There are 1898 rows and 9 columns present in the data.

Observations:

There are 1898 rows and 9 columns present in the data. This means that there are 1898 orders that we can analyze.

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]

In [33]:

# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:

  • order_id, customer_id, food_preparation_time and delivery_time are integers.
  • restaurant_name, cuisine_type, day_of_the_week and rating are strings.
  • cost_of_ther_order is a float (decimal)
  • rating should be a number. I will convert that at some point.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]

In [34]:

# Write your code here
df.isnull().sum()

Out[34]:

0
order_id 0
customer_id 0
restaurant_name 0
cuisine_type 0
cost_of_the_order 0
day_of_the_week 0
rating 0
food_preparation_time 0
delivery_time 0

Observations:

  1. There were no columns on any rows without values.
  2. There are instances of certain rows with “rating” missing and instead having the string “Not given”. In order to address this, I will convert the rating values to integer values later in this workbook since the rest of the ratings are integers (in string format before).

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]

In [35]:

# Write your code here
df.describe()

Out[35]:

order_id customer_id cost_of_the_order food_preparation_time delivery_time
count 1.898000e+03 1898.000000 1898.000000 1898.000000 1898.000000
mean 1.477496e+06 171168.478398 16.498851 27.371970 24.161749
std 5.480497e+02 113698.139743 7.483812 4.632481 4.972637
min 1.476547e+06 1311.000000 4.470000 20.000000 15.000000
25% 1.477021e+06 77787.750000 12.080000 23.000000 20.000000
50% 1.477496e+06 128600.000000 14.140000 27.000000 25.000000
75% 1.477970e+06 270525.000000 22.297500 31.000000 28.000000
max 1.478444e+06 405334.000000 35.410000 35.000000 33.000000

In [36]:

# Write your code here
# Display only the statistics for "food_preparation_time"
food_prep_stats = df['food_preparation_time'].describe()
food_prep_stats

Out[36]:

food_preparation_time
count 1898.000000
mean 27.371970
std 4.632481
min 20.000000
25% 23.000000
50% 27.000000
75% 31.000000
max 35.000000


In [37]:

# Extract only the minimum, mean, and maximum values for "food_preparation_time"
food_prep_specific_stats = df['food_preparation_time'].agg(['min', 'mean', 'max']).round(2)
food_prep_specific_stats

Out[37]:

food_preparation_time
min 20.00
mean 27.37
max 35.00

Observations:

Food Prep Stats

  1. Minimum Prep Time: 20 minutes
  2. Average Prep Time: 27.37 minutes
  3. Maximun Prep Time: 35 minutes

Question 5: How many orders are not rated? [1 mark]

In [38]:

# Write the code here
# Count the number of rows where "rating" is "Not given"
not_given_count = (df['rating'] == "Not given").sum()
not_given_count

Out[38]:

736

Observations:

There are 736 instances where rating was “Not given”.

In [40]:

# Convert 'Not given' to NaN
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# Calculate the mean rating per restaurant
mean_ratings_per_restaurant = df.groupby('restaurant_name')['rating'].mean().round()
# Use the map function to replace NaN in 'rating' with the mean rating of the respective restaurant
df['rating'] = df.apply(
    lambda row: mean_ratings_per_restaurant[row['restaurant_name']] if pd.isna(row['rating']) else row['rating'],
    axis=1
)
# Check if there are still NaN values and replace them directly without inplace=True
overall_mean_rating = round(df['rating'].mean())
df['rating'] = df['rating'].fillna(overall_mean_rating)
# Convert 'rating' to integer
df['rating'] = df['rating'].astype(int)
df.head()

Out[40]:

order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend 4 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend 4 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

In [41]:

# Count the number of rows where "rating" is "Not given" just to prove that
# we addressed that issue.
not_given_count = (df['rating'] == "Not given").sum()
not_given_count

Out[41]:

0

Observations:

I converted all ratings to integers and assigned the mean (targeted mean per restaurant based on other orders) rounded to the nearest integer where the value was “Not given”. I will use these ratings

Exploratory Data Analysis (EDA)

Univariate Analysis

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]

In [42]:

# Let's eyball the girst 5 rows again
df.head()

Out[42]:

order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend 4 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend 4 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24

In [51]:

# Set plot style
sns.set_style("whitegrid")
# Creating a histogram for the cost of the orders
plt.figure(figsize=(10, 6))
sns.histplot(df['cost_of_the_order'], bins=30, kde=True, color="blue")
plt.title('Distribution of Cost of the Orders')
plt.xlabel('Cost of the Order ($)')
plt.ylabel('Frequency')
plt.show()

No description has been provided for this image

Observations:

  1. Skewed Distribution: The distribution of the cost of orders is right-skewed, indicating that most of the orders are concentrated in the lower price range, with fewer orders as the price increases.
  2. Common Price Range: The most common price range for orders appears to be between approximately $10 and $20. This suggests that most customers opt for moderately priced items.

In [52]:

# Creating a box plot for the cost of the orders
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['cost_of_the_order'])
plt.title('Box Plot of the Cost of Orders')
plt.xlabel('Cost of the Order ($)')
plt.show()

No description has been provided for this image

Observations:

  1. Central Tendency: The median cost of the orders is around $15-$20, suggesting that this is a typical amount customers spend per order.
  2. Spread and Variability: The interquartile range (IQR), represented by the box, is relatively compact, indicating that the majority of orders cluster around the median price, within a moderate price range.

In [53]:

# Create a count plot for "rating"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['rating'], hue=df['rating'], palette="viridis", legend=False,
              order=sorted(df['rating'].unique()))
plt.title("Count of Each Rating")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

No description has been provided for this image

Observations:

  1. Overall, ratings are good, with most being 4 and above.

In [45]:

# Create a boxplot for "food_preparation_time"
plt.figure(figsize=(8, 5))
sns.boxplot(y=df['food_preparation_time'], color="green")
plt.title("Boxplot of Food Preparation Time")
plt.ylabel("Food Preparation Time (minutes)")
plt.show()

No description has been provided for this image

Observations:

  1. Median and Central Tendency: The median food preparation time is centered around 25 minutes. This indicates that half of the food orders are prepared in less than 25 minutes, and the other half takes longer, showcasing a balance in preparation times across the dataset.
  2. Interquartile Range (IQR): The IQR, depicted by the box, is relatively narrow, which suggests that the majority of food preparation times are consistent and cluster around the median. This consistency might be indicative of standardized processes within restaurants or similar types of dishes being ordered.
  3. Variability: The whiskers, which extend to the lowest and highest typical preparation times, show a wider range than the IQR. This variability outside the central cluster can indicate differences in restaurant efficiency, menu complexity, or operational challenges during peak times.
  4. Operational Insights: Restaurants and the food delivery service can use this data to identify and investigate the reasons behind unusually long or short preparation times. Addressing these could improve overall efficiency, reduce customer wait times, and enhance satisfaction.

In [65]:

# Creating a count plot for delivery time
plt.figure(figsize=(10, 6))
sns.countplot(x='delivery_time', data=df, hue='delivery_time', dodge=False, palette='viridis')
plt.title('Count of Orders by Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.ylabel('Count of Orders')
plt.xticks(rotation=90)  # Rotating x-axis labels for better visibility
plt.legend().set_visible(False)  # Properly setting legend visibility
plt.show()

No description has been provided for this image
In [56]:

# Creating a box plot for delivery time
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['delivery_time'])
plt.title('Box Plot of Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.show()

No description has been provided for this image

Observations:

  1. Median Delivery Time: The median delivery time is visually apparent in the plot, suggesting a typical delivery time that most orders adhere to. This median time provides a central benchmark for evaluating delivery efficiency.
  2. Interquartile Range (IQR): The box, which encapsulates the middle 50% of delivery times, is relatively tight. This indicates that a majority of deliveries are consistent in duration, showing effective standardization and predictability in the delivery process.
  3. Outliers: The plot reveals several outliers on the higher end, indicating deliveries that take significantly longer than usual. These outliers might be due to external factors such as traffic conditions, distance, or order complications.
  4. Operational Insights: Identifying the reasons behind the longer delivery times could help in optimizing routes, improving delivery scheduling, and potentially selecting more reliable delivery methods or personnel. Similarly, analyzing why some deliveries are unusually quick could highlight best practices that could be replicated across the service.

In [49]:

# Create a count plot for "cuisine_type"
plt.figure(figsize=(10, 5))
sns.countplot(y=df['cuisine_type'], hue=df['cuisine_type'], palette="viridis", legend=False,
              order=df['cuisine_type'].value_counts().index)
plt.title("Count Plot of Cuisine Type")
plt.xlabel("Count")
plt.ylabel("Cuisine Type")
plt.show()

No description has been provided for this image

Observations:

Clearly the top cuisine types are American, Japanese and Italian.

In [50]:

# Create a count plot for "day_of_the_week"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['day_of_the_week'], hue=df['day_of_the_week'], palette="magma", legend=False,
              order=df['day_of_the_week'].value_counts().index)
plt.title("Count Plot of Day of the Week")
plt.xlabel("Day of the Week")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

No description has been provided for this image

Observations:

Clearly a lot more orders are placed during the weekend!

Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]

In [66]:

# Write the code here
# Create a count plot for the top 5 restaurants
plt.figure(figsize=(10, 5))
sns.countplot(y=df['restaurant_name'], hue=df['restaurant_name'], palette="deep", legend=False,
              order=df['restaurant_name'].value_counts().index[:5])
plt.title("Top 5 Restaurants by Number of Orders")
plt.xlabel("Number of Orders")
plt.ylabel("Restaurant Name")
plt.show()

No description has been provided for this image

Observations:

The top 5 restaurants in terms of orders received are:

  1. Shake Shack
  2. The Meatball Shop
  3. Blue Ribbon Sushi
  4. Blue Ribbon Fried Chicken
  5. Parm

In [67]:

# Write the code here
# Filter data for weekends (labeled as "Weekend" in the dataset)
weekend_data = df[df['day_of_the_week'] == 'Weekend']
# Create a count plot for cuisine types on weekends
# Create a count plot for cuisine types on weekends with hue assigned and ordered by popularity
plt.figure(figsize=(10, 5))
sns.countplot(y=weekend_data['cuisine_type'], hue=weekend_data['cuisine_type'],
              palette=sns.color_palette("magma", len(weekend_data['cuisine_type'].unique())),
              order=weekend_data['cuisine_type'].value_counts().index, dodge=False)
plt.title("Most Popular Cuisine on Weekends")
plt.xlabel("Number of Orders")
plt.ylabel("Cuisine Type")
plt.legend([],[], frameon=False)  # Hide legend
plt.show()
# Get the most popular cuisine type on weekends
popular_cuisine_weekends = weekend_data['cuisine_type'].value_counts().idxmax()
# Display the most popular cuisine type
popular_cuisine_weekends

No description has been provided for this image
Out[67]:

'American'

Observations:

Clearly the most popular cuisine during weekends is American.

Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]

In [68]:

# Write the code here
# Calculate the percentage of orders that cost more than 20 dollars
total_orders = len(df)
orders_above_20 = len(df[df['cost_of_the_order'] > 20])
percentage_above_20 = (orders_above_20 / total_orders) * 100
percentage_above_20

Out[68]:

29.24130663856691

In [69]:

# Create a pie chart to visualize the percentage of orders above and below $20
plt.figure(figsize=(6, 6))
labels = ["Orders > $20", "Orders ≤ $20"]
sizes = [orders_above_20, total_orders - orders_above_20]
colors = ["#FF9999", "#66B2FF"]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title("Percentage of Orders Above and Below $20")
plt.show()

No description has been provided for this image

Observations:

About 29% of the orders cost more than 20 dollars. The rest (~71%) are below.

Question 10: What is the mean order delivery time? [1 mark]

In [70]:

# Write the code here
mean_delivery_time = df['delivery_time'].mean()
mean_delivery_time

Out[70]:

24.161749209694417

In [71]:

# Create a histogram with mean line to visualize delivery time distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['delivery_time'], kde=True, bins=20, color="blue", alpha=0.7)
plt.axvline(mean_delivery_time, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_delivery_time:.2f} min')
plt.title("Distribution of Delivery Time with Mean Indicator")
plt.xlabel("Delivery Time (minutes)")
plt.ylabel("Frequency")
plt.legend()
plt.show()

No description has been provided for this image

Observations:

The mean delivery time is 24.16 min.

The peak of the histogram suggests that most delivery times are close to the mean.
This indicates a relatively consistent delivery time for most orders.

The histogram is right-skewed (longer tail on the right), which means that some orders take significantly longer than the average.

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [72]:

# Write the code here
top_3_customers = df['customer_id'].value_counts().head(3)
top_3_customers

Out[72]:

count
customer_id
52832 13
47440 10
83287 9


In [73]:

# Create a bar chart for the top 3 most frequent customers
plt.figure(figsize=(8, 5))
sns.barplot(x=top_3_customers.index, y=top_3_customers.values,
            hue=top_3_customers.index, palette="coolwarm", legend=False)
plt.title("Top 3 Most Frequent Customers")
plt.xlabel("Customer ID")
plt.ylabel("Number of Orders")
plt.show()

No description has been provided for this image

Observations:

The above charts show the top 3 customers.

Multivariate Analysis

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]

In [79]:

# Calculating total time from order to delivery by summing food preparation time and delivery time
df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time']
# Creating a heatmap to visualize the correlation between total time, cost of the order, and rating
heatmap_data = df[['total_time_from_order_to_delivery', 'cost_of_the_order', 'rating']]
correlation = heatmap_data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap between Total Time, Cost, and Rating')
plt.show()

No description has been provided for this image

Observations:

  1. Total Time vs. Rating: The correlation between total time from order to delivery and rating is very low, as indicated by a coefficient close to zero. This reinforces the observation from the scatter plot that total time does not significantly influence the ratings directly.
  2. Cost vs. Rating: Similarly, the correlation between cost of the order and rating is also very low. This suggests that higher costs are not directly associated with higher customer satisfaction (or dissatisfaction), at least not linearly.
  3. Cost vs. Total Time: There is a negligible correlation between the cost of the order and the total time from order to delivery. This indicates that longer preparation and delivery times are not necessarily associated with higher costs.

In [84]:

# 1. Cuisine Type Analysis - Comparing average delivery and preparation times across cuisines
# Aggregating average times per cuisine
cuisine_times = df.groupby('cuisine_type')[['food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 cuisines to prevent memory overload
top_cuisines = df['cuisine_type'].value_counts().index[:20]
cuisine_times_filtered = cuisine_times[cuisine_times['cuisine_type'].isin(top_cuisines)]
# Plotting average food preparation time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='food_preparation_time', hue='cuisine_type', data=cuisine_times_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='delivery_time', hue='cuisine_type', data=cuisine_times_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()

No description has been provided for this image
No description has been provided for this image

Observations:

Food Preparation Time by Cuisine Type:

  1. Certain cuisines take significantly longer to prepare than others.
    The variation in preparation time could be due to differences in dish complexity, cooking techniques, or restaurant efficiency.
    Delivery Time by Cuisine Type:
  2. Delivery times also vary across cuisines, which may be influenced by restaurant locations, demand levels, or how well the food travels.
    Some cuisines might be more frequently ordered from farther distances, leading to longer delivery times.

In [88]:

# 2. Restaurant Performance Analysis - Comparing average ratings, delivery times, and preparation times across restaurants
# Aggregating performance metrics per restaurant
restaurant_performance = df.groupby('restaurant_name')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 most popular restaurants to prevent memory overload
top_restaurants = df['restaurant_name'].value_counts().index[:20]
restaurant_performance_filtered = restaurant_performance[restaurant_performance['restaurant_name'].isin(top_restaurants)]
# Plotting average ratings by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='rating', hue='restaurant_name', data=restaurant_performance_filtered, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.show()
# Plotting average food preparation time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='food_preparation_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='delivery_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations

Average Ratings by Restaurant:

  1. Some restaurants consistently receive higher ratings than others, indicating better customer satisfaction.
    The variation in ratings may be influenced by factors such as food quality, service efficiency, and pricing.
    Average Food Preparation Time by Restaurant:
  2. Certain restaurants take significantly longer to prepare food than others.
    This could be due to the type of cuisine, restaurant efficiency, or the complexity of menu items.
    Average Delivery Time by Restaurant:
  3. Some restaurants have longer delivery times on average, which could be due to location, demand, or delivery efficiency.
    Identifying which restaurants have consistently longer delivery times may help FoodHub optimize delivery logistics.

In [92]:

# 3. Day of the Week Analysis - Comparing ratings, preparation times, and delivery times with hue set to x
# Aggregating performance metrics per day of the week
day_performance = df.groupby('day_of_the_week')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Plotting average ratings by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='rating', hue='day_of_the_week', data=day_performance, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rating')
plt.show()
# Plotting average food preparation time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='food_preparation_time', hue='day_of_the_week', data=day_performance, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.show()
# Plotting average delivery time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', hue='day_of_the_week', data=day_performance, palette='magma', dodge=False)
plt.title('Average Delivery Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Delivery Time (minutes)')
plt.show()

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

Average Ratings by Day of the Week:

  1. Ratings remain relatively consistent across weekdays and weekends, indicating stable customer satisfaction regardless of the day.
    Average Food Preparation Time by Day of the Week:
  2. There is a slight increase in preparation times on weekends, suggesting a higher volume of orders or operational constraints.
    Restaurants might experience increased kitchen workloads on weekends, leading to slightly slower preparation.
    Average Delivery Time by Day of the Week:
  3. Delivery times show a similar pattern, with weekends experiencing slightly longer durations.
    Higher weekend demand could be affecting overall delivery efficiency.

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]

In [93]:

# Write the code here
# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
qualified_restaurants

Out[93]:

count mean
restaurant_name
Blue Ribbon Fried Chicken 96 4.218750
Blue Ribbon Sushi 119 4.134454
Parm 68 4.073529
RedFarm Broadway 59 4.169492
RedFarm Hudson 55 4.109091
Shake Shack 219 4.168950
The Meatball Shop 132 4.689394


In [96]:

# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria (more than 50 ratings and average rating above 4)
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
# Sort the qualified restaurants by average rating in descending order
qualified_restaurants_sorted = qualified_restaurants.sort_values(by='mean', ascending=False).reset_index()
# Create a bar chart with sorted restaurants and assign the y variable to hue
plt.figure(figsize=(10, 5))
sns.barplot(y='restaurant_name', x='mean', hue='restaurant_name', data=qualified_restaurants_sorted, palette="coolwarm", dodge=False)
plt.title("Restaurants Eligible for Promotional Offer (Sorted by Avg Rating)")
plt.xlabel("Average Rating")
plt.ylabel("Restaurant Name")
plt.xlim(4, 5)  # Set x-axis range to focus on valid rating values
plt.show()

No description has been provided for this image

Observations:

This bar chart displays restaurants that qualify for a promotional offer based on:
Having more than 50 ratings (ensuring enough data points for reliability).
Maintaining an average rating above 4.0.
Restaurants are sorted in descending order of average rating, highlighting the best-rated ones.
The hue is assigned to the restaurant name, ensuring distinct colors for each.

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]

In [97]:

# Write the code here
# Define commission rates based on order cost
def calculate_commission(cost):
    if cost > 20:
        return cost * 0.25  # 25% commission for orders above $20
    elif cost > 5:
        return cost * 0.15  # 15% commission for orders above $5
    else:
        return 0  # No commission for orders $5 or below
# Apply the commission calculation to each order
df['commission'] = df['cost_of_the_order'].apply(calculate_commission)
# Calculate total revenue generated by the company
total_revenue = df['commission'].sum()
total_revenue

Out[97]:

6166.303

Observations:

Net Revenue Generated by the Company:
The total revenue generated by the company from commissions across all orders is $6,166.30.

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]

In [100]:

# Write the code here
# Calculate total time from order placement to delivery
df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time']
# Count total number of orders
total_orders = len(df)
# Count orders that took more than 60 minutes
orders_above_60_min = df[df['total_time_from_order_to_delivery'] > 60].shape[0]
# Calculate percentage of orders taking more than 60 minutes
percentage_above_60_min = (orders_above_60_min / total_orders) * 100
round(percentage_above_60_min, 2)

Out[100]:

10.54

Observations:

10.54% of orders take more than 60 minutes from order placement to delivery.

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]

In [102]:

# Write the code here
# Calculate the mean delivery time for weekdays and weekends
delivery_time_analysis = df.groupby('day_of_the_week')['delivery_time'].mean().reset_index()
# Plotting the variation in mean delivery time
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', data=delivery_time_analysis, palette='magma', hue='day_of_the_week', dodge=False)
plt.title('Mean Delivery Time on Weekdays vs. Weekends')
plt.xlabel('Day of the Week')
plt.ylabel('Mean Delivery Time (minutes)')
plt.show()
# Display mean delivery times for weekdays and weekends
delivery_time_analysis

No description has been provided for this image
Out[102]:

day_of_the_week delivery_time
0 Weekday 28.340037
1 Weekend 22.470022

Observations:

The mean delivery time is lower on weekends than on weekdays.

Possible reasons:
Lower traffic congestion on weekends, leading to faster deliveries.
More efficient delivery management due to higher expected demand on weekends.
Fewer business district orders (which might take longer due to distance or traffic).

Conclusion and Recommendations

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]

Conclusions:

My analysis of FoodHub’s order and delivery data provides valuable insights into restaurant performance, customer satisfaction, and operational efficiency. Key findings include:

  1. Delivery & Preparation Time Insights
  • 10.54% of orders take more than 60 minutes from order placement to delivery, which could negatively impact customer satisfaction.
  • Mean delivery time is longer on weekdays (28.34 min) than on weekends (22.47 min), suggesting that traffic congestion or restaurant availability might be factors affecting delivery speed.
  • Some cuisine types and restaurants have significantly longer preparation times, which contributes to overall delays.
  1. Restaurant & Cuisine Performance
  • Certain restaurants consistently receive high ratings (above 4.0) with more than 50 orders, making them strong candidates for promotional offers.
  • Some cuisines have higher preparation times than others, affecting overall delivery efficiency.
  • The most profitable orders come from customers who spend more than $20, as they generate the highest commissions for FoodHub.
  1. Customer Satisfaction & Revenue
  • Customer ratings are not strongly correlated with delivery time, suggesting that food quality and overall experience play a bigger role in satisfaction.
  • The company generated 6,166.30 dollars in commission revenue, with most revenue coming from orders above $20.
  • Certain restaurants consistently take longer to prepare food, which might contribute to lower ratings.

Recommendations:

  1. Improve Delivery & Operational Efficiency
  • Optimize restaurant partnerships: Identify restaurants with consistent delays and work with them to improve preparation efficiency.
  • Prioritize high-performing restaurants: Promote and feature restaurants with high ratings and good efficiency to improve customer satisfaction.
  • Reduce weekday delivery delays: Explore better delivery scheduling and traffic-optimized routing for faster weekday deliveries.
  1. Enhance Customer Satisfaction & Engagement
  • Encourage high-rated restaurants: Offer incentives or priority placement in the app for restaurants maintaining an average rating above 4.0.
  • Improve customer communication: Notify customers in real-time if orders are expected to take longer than 60 minutes to manage expectations.
  • Introduce loyalty rewards: Offer discounts or free delivery for customers who frequently order from top-rated restaurants.
  1. Revenue Optimization Strategies
  • Increase commissions on premium orders: Since most revenue comes from orders above $20, consider adjusting pricing models to encourage higher-value purchases.
  • Feature premium & high-performing cuisines: If certain cuisines generate higher-value orders, highlight them in the app for better visibility.
  • Reduce inefficiencies in long-prep-time cuisines: Work with restaurants offering slow-prep cuisines to improve kitchen efficiency and reduce preparation times.

You may also like

Leave a Reply

Discover more from Mario Mamalis

Subscribe now to keep reading and get access to the full archive.

Continue reading