Explore a data science case study, showcasing data-driven strategies for exploring, cleaning and preparing data for statistical analysis and visualizations.
Project Foundations for Data Science: FoodHub Data Analysis¶
Context¶
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer’s location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
Objective¶
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
Data Description¶
The data contains the different data related to a food order. The detailed data dictionary is given below.
Data Dictionary¶
- order_id: Unique ID of the order
- customer_id: ID of the customer who ordered the food
- restaurant_name: Name of the restaurant
- cuisine_type: Cuisine ordered by the customer
- cost: Cost of the order
- day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
- rating: Rating given by the customer out of 5
- food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant’s order confirmation and the delivery person’s pick-up confirmation.
- delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person’s pick-up confirmation and drop-off information
Let us start by importing the required libraries¶
In [9]:
# import libraries for data manipulation import numpy as np import pandas as pd # import libraries for data visualization import matplotlib.pyplot as plt import seaborn as sns
Understanding the structure of the data¶
In [10]:
#Access the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [31]:
# read the data
df = pd.read_csv('/content/drive/MyDrive/MIT-ADSP/0131-Foundations-Python-Statistics/Week 2/Project - Food Hub/foodhub_order.csv')
# returns the first 5 rows
df.head()
Out[31]:
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | Not given | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | Not given | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
Observations:¶
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
Question 1: How many rows and columns are present in the data? [0.5 mark]¶
In [32]:
# Write your code here
rows, columns = df.shape
print(f"There are {rows} rows and {columns} columns present in the data.")
There are 1898 rows and 9 columns present in the data.
Observations:¶
There are 1898 rows and 9 columns present in the data. This means that there are 1898 orders that we can analyze.
Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]¶
In [33]:
# Use info() to print a concise summary of the DataFrame df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1898 non-null object 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 dtypes: float64(1), int64(4), object(4) memory usage: 133.6+ KB
Observations:¶
- order_id, customer_id, food_preparation_time and delivery_time are integers.
- restaurant_name, cuisine_type, day_of_the_week and rating are strings.
- cost_of_ther_order is a float (decimal)
- rating should be a number. I will convert that at some point.
Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]¶
In [34]:
# Write your code here df.isnull().sum()
Out[34]:
| 0 | |
|---|---|
| order_id | 0 |
| customer_id | 0 |
| restaurant_name | 0 |
| cuisine_type | 0 |
| cost_of_the_order | 0 |
| day_of_the_week | 0 |
| rating | 0 |
| food_preparation_time | 0 |
| delivery_time | 0 |
Observations:¶
- There were no columns on any rows without values.
- There are instances of certain rows with “rating” missing and instead having the string “Not given”. In order to address this, I will convert the rating values to integer values later in this workbook since the rest of the ratings are integers (in string format before).
Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]¶
In [35]:
# Write your code here df.describe()
Out[35]:
| order_id | customer_id | cost_of_the_order | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|
| count | 1.898000e+03 | 1898.000000 | 1898.000000 | 1898.000000 | 1898.000000 |
| mean | 1.477496e+06 | 171168.478398 | 16.498851 | 27.371970 | 24.161749 |
| std | 5.480497e+02 | 113698.139743 | 7.483812 | 4.632481 | 4.972637 |
| min | 1.476547e+06 | 1311.000000 | 4.470000 | 20.000000 | 15.000000 |
| 25% | 1.477021e+06 | 77787.750000 | 12.080000 | 23.000000 | 20.000000 |
| 50% | 1.477496e+06 | 128600.000000 | 14.140000 | 27.000000 | 25.000000 |
| 75% | 1.477970e+06 | 270525.000000 | 22.297500 | 31.000000 | 28.000000 |
| max | 1.478444e+06 | 405334.000000 | 35.410000 | 35.000000 | 33.000000 |
In [36]:
# Write your code here # Display only the statistics for "food_preparation_time" food_prep_stats = df['food_preparation_time'].describe() food_prep_stats
Out[36]:
| food_preparation_time | |
|---|---|
| count | 1898.000000 |
| mean | 27.371970 |
| std | 4.632481 |
| min | 20.000000 |
| 25% | 23.000000 |
| 50% | 27.000000 |
| 75% | 31.000000 |
| max | 35.000000 |
In [37]:
# Extract only the minimum, mean, and maximum values for "food_preparation_time" food_prep_specific_stats = df['food_preparation_time'].agg(['min', 'mean', 'max']).round(2) food_prep_specific_stats
Out[37]:
| food_preparation_time | |
|---|---|
| min | 20.00 |
| mean | 27.37 |
| max | 35.00 |
Observations:¶
Food Prep Stats
- Minimum Prep Time: 20 minutes
- Average Prep Time: 27.37 minutes
- Maximun Prep Time: 35 minutes
Question 5: How many orders are not rated? [1 mark]¶
In [38]:
# Write the code here # Count the number of rows where "rating" is "Not given" not_given_count = (df['rating'] == "Not given").sum() not_given_count
Out[38]:
736
Observations:¶
There are 736 instances where rating was “Not given”.
In [40]:
# Convert 'Not given' to NaN
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# Calculate the mean rating per restaurant
mean_ratings_per_restaurant = df.groupby('restaurant_name')['rating'].mean().round()
# Use the map function to replace NaN in 'rating' with the mean rating of the respective restaurant
df['rating'] = df.apply(
lambda row: mean_ratings_per_restaurant[row['restaurant_name']] if pd.isna(row['rating']) else row['rating'],
axis=1
)
# Check if there are still NaN values and replace them directly without inplace=True
overall_mean_rating = round(df['rating'].mean())
df['rating'] = df['rating'].fillna(overall_mean_rating)
# Convert 'rating' to integer
df['rating'] = df['rating'].astype(int)
df.head()
Out[40]:
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | 4 | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | 4 | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
In [41]:
# Count the number of rows where "rating" is "Not given" just to prove that # we addressed that issue. not_given_count = (df['rating'] == "Not given").sum() not_given_count
Out[41]:
0
Observations:¶
I converted all ratings to integers and assigned the mean (targeted mean per restaurant based on other orders) rounded to the nearest integer where the value was “Not given”. I will use these ratings
Exploratory Data Analysis (EDA)¶
Univariate Analysis¶
Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]¶
In [42]:
# Let's eyball the girst 5 rows again df.head()
Out[42]:
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | 4 | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | 4 | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
In [51]:
# Set plot style
sns.set_style("whitegrid")
# Creating a histogram for the cost of the orders
plt.figure(figsize=(10, 6))
sns.histplot(df['cost_of_the_order'], bins=30, kde=True, color="blue")
plt.title('Distribution of Cost of the Orders')
plt.xlabel('Cost of the Order ($)')
plt.ylabel('Frequency')
plt.show()
Observations:¶
- Skewed Distribution: The distribution of the cost of orders is right-skewed, indicating that most of the orders are concentrated in the lower price range, with fewer orders as the price increases.
- Common Price Range: The most common price range for orders appears to be between approximately $10 and $20. This suggests that most customers opt for moderately priced items.
In [52]:
# Creating a box plot for the cost of the orders
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['cost_of_the_order'])
plt.title('Box Plot of the Cost of Orders')
plt.xlabel('Cost of the Order ($)')
plt.show()
Observations:¶
- Central Tendency: The median cost of the orders is around $15-$20, suggesting that this is a typical amount customers spend per order.
- Spread and Variability: The interquartile range (IQR), represented by the box, is relatively compact, indicating that the majority of orders cluster around the median price, within a moderate price range.
In [53]:
# Create a count plot for "rating"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['rating'], hue=df['rating'], palette="viridis", legend=False,
order=sorted(df['rating'].unique()))
plt.title("Count of Each Rating")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()
Observations:¶
- Overall, ratings are good, with most being 4 and above.
In [45]:
# Create a boxplot for "food_preparation_time"
plt.figure(figsize=(8, 5))
sns.boxplot(y=df['food_preparation_time'], color="green")
plt.title("Boxplot of Food Preparation Time")
plt.ylabel("Food Preparation Time (minutes)")
plt.show()
Observations:¶
- Median and Central Tendency: The median food preparation time is centered around 25 minutes. This indicates that half of the food orders are prepared in less than 25 minutes, and the other half takes longer, showcasing a balance in preparation times across the dataset.
- Interquartile Range (IQR): The IQR, depicted by the box, is relatively narrow, which suggests that the majority of food preparation times are consistent and cluster around the median. This consistency might be indicative of standardized processes within restaurants or similar types of dishes being ordered.
- Variability: The whiskers, which extend to the lowest and highest typical preparation times, show a wider range than the IQR. This variability outside the central cluster can indicate differences in restaurant efficiency, menu complexity, or operational challenges during peak times.
- Operational Insights: Restaurants and the food delivery service can use this data to identify and investigate the reasons behind unusually long or short preparation times. Addressing these could improve overall efficiency, reduce customer wait times, and enhance satisfaction.
In [65]:
# Creating a count plot for delivery time
plt.figure(figsize=(10, 6))
sns.countplot(x='delivery_time', data=df, hue='delivery_time', dodge=False, palette='viridis')
plt.title('Count of Orders by Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.ylabel('Count of Orders')
plt.xticks(rotation=90) # Rotating x-axis labels for better visibility
plt.legend().set_visible(False) # Properly setting legend visibility
plt.show()
In [56]:
# Creating a box plot for delivery time
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['delivery_time'])
plt.title('Box Plot of Delivery Time')
plt.xlabel('Delivery Time (minutes)')
plt.show()
Observations:¶
- Median Delivery Time: The median delivery time is visually apparent in the plot, suggesting a typical delivery time that most orders adhere to. This median time provides a central benchmark for evaluating delivery efficiency.
- Interquartile Range (IQR): The box, which encapsulates the middle 50% of delivery times, is relatively tight. This indicates that a majority of deliveries are consistent in duration, showing effective standardization and predictability in the delivery process.
- Outliers: The plot reveals several outliers on the higher end, indicating deliveries that take significantly longer than usual. These outliers might be due to external factors such as traffic conditions, distance, or order complications.
- Operational Insights: Identifying the reasons behind the longer delivery times could help in optimizing routes, improving delivery scheduling, and potentially selecting more reliable delivery methods or personnel. Similarly, analyzing why some deliveries are unusually quick could highlight best practices that could be replicated across the service.
In [49]:
# Create a count plot for "cuisine_type"
plt.figure(figsize=(10, 5))
sns.countplot(y=df['cuisine_type'], hue=df['cuisine_type'], palette="viridis", legend=False,
order=df['cuisine_type'].value_counts().index)
plt.title("Count Plot of Cuisine Type")
plt.xlabel("Count")
plt.ylabel("Cuisine Type")
plt.show()
Observations:¶
Clearly the top cuisine types are American, Japanese and Italian.
In [50]:
# Create a count plot for "day_of_the_week"
plt.figure(figsize=(8, 5))
sns.countplot(x=df['day_of_the_week'], hue=df['day_of_the_week'], palette="magma", legend=False,
order=df['day_of_the_week'].value_counts().index)
plt.title("Count Plot of Day of the Week")
plt.xlabel("Day of the Week")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()
Observations:¶
Clearly a lot more orders are placed during the weekend!
Question 7: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]¶
In [66]:
# Write the code here
# Create a count plot for the top 5 restaurants
plt.figure(figsize=(10, 5))
sns.countplot(y=df['restaurant_name'], hue=df['restaurant_name'], palette="deep", legend=False,
order=df['restaurant_name'].value_counts().index[:5])
plt.title("Top 5 Restaurants by Number of Orders")
plt.xlabel("Number of Orders")
plt.ylabel("Restaurant Name")
plt.show()
Observations:¶
The top 5 restaurants in terms of orders received are:
- Shake Shack
- The Meatball Shop
- Blue Ribbon Sushi
- Blue Ribbon Fried Chicken
- Parm
Question 8: Which is the most popular cuisine on weekends? [1 mark]¶
In [67]:
# Write the code here
# Filter data for weekends (labeled as "Weekend" in the dataset)
weekend_data = df[df['day_of_the_week'] == 'Weekend']
# Create a count plot for cuisine types on weekends
# Create a count plot for cuisine types on weekends with hue assigned and ordered by popularity
plt.figure(figsize=(10, 5))
sns.countplot(y=weekend_data['cuisine_type'], hue=weekend_data['cuisine_type'],
palette=sns.color_palette("magma", len(weekend_data['cuisine_type'].unique())),
order=weekend_data['cuisine_type'].value_counts().index, dodge=False)
plt.title("Most Popular Cuisine on Weekends")
plt.xlabel("Number of Orders")
plt.ylabel("Cuisine Type")
plt.legend([],[], frameon=False) # Hide legend
plt.show()
# Get the most popular cuisine type on weekends
popular_cuisine_weekends = weekend_data['cuisine_type'].value_counts().idxmax()
# Display the most popular cuisine type
popular_cuisine_weekends
Out[67]:
'American'
Observations:¶
Clearly the most popular cuisine during weekends is American.
Question 9: What percentage of the orders cost more than 20 dollars? [2 marks]¶
In [68]:
# Write the code here # Calculate the percentage of orders that cost more than 20 dollars total_orders = len(df) orders_above_20 = len(df[df['cost_of_the_order'] > 20]) percentage_above_20 = (orders_above_20 / total_orders) * 100 percentage_above_20
Out[68]:
29.24130663856691
In [69]:
# Create a pie chart to visualize the percentage of orders above and below $20
plt.figure(figsize=(6, 6))
labels = ["Orders > $20", "Orders ≤ $20"]
sizes = [orders_above_20, total_orders - orders_above_20]
colors = ["#FF9999", "#66B2FF"]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title("Percentage of Orders Above and Below $20")
plt.show()
Observations:¶
About 29% of the orders cost more than 20 dollars. The rest (~71%) are below.
Question 10: What is the mean order delivery time? [1 mark]¶
In [70]:
# Write the code here mean_delivery_time = df['delivery_time'].mean() mean_delivery_time
Out[70]:
24.161749209694417
In [71]:
# Create a histogram with mean line to visualize delivery time distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['delivery_time'], kde=True, bins=20, color="blue", alpha=0.7)
plt.axvline(mean_delivery_time, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_delivery_time:.2f} min')
plt.title("Distribution of Delivery Time with Mean Indicator")
plt.xlabel("Delivery Time (minutes)")
plt.ylabel("Frequency")
plt.legend()
plt.show()
Observations:¶
The mean delivery time is 24.16 min.
The peak of the histogram suggests that most delivery times are close to the mean.
This indicates a relatively consistent delivery time for most orders.
The histogram is right-skewed (longer tail on the right), which means that some orders take significantly longer than the average.
Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]¶
In [72]:
# Write the code here top_3_customers = df['customer_id'].value_counts().head(3) top_3_customers
Out[72]:
| count | |
|---|---|
| customer_id | |
| 52832 | 13 |
| 47440 | 10 |
| 83287 | 9 |
In [73]:
# Create a bar chart for the top 3 most frequent customers
plt.figure(figsize=(8, 5))
sns.barplot(x=top_3_customers.index, y=top_3_customers.values,
hue=top_3_customers.index, palette="coolwarm", legend=False)
plt.title("Top 3 Most Frequent Customers")
plt.xlabel("Customer ID")
plt.ylabel("Number of Orders")
plt.show()
Observations:¶
The above charts show the top 3 customers.
Multivariate Analysis¶
Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]¶
In [79]:
# Calculating total time from order to delivery by summing food preparation time and delivery time
df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time']
# Creating a heatmap to visualize the correlation between total time, cost of the order, and rating
heatmap_data = df[['total_time_from_order_to_delivery', 'cost_of_the_order', 'rating']]
correlation = heatmap_data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap between Total Time, Cost, and Rating')
plt.show()
Observations:¶
- Total Time vs. Rating: The correlation between total time from order to delivery and rating is very low, as indicated by a coefficient close to zero. This reinforces the observation from the scatter plot that total time does not significantly influence the ratings directly.
- Cost vs. Rating: Similarly, the correlation between cost of the order and rating is also very low. This suggests that higher costs are not directly associated with higher customer satisfaction (or dissatisfaction), at least not linearly.
- Cost vs. Total Time: There is a negligible correlation between the cost of the order and the total time from order to delivery. This indicates that longer preparation and delivery times are not necessarily associated with higher costs.
In [84]:
# 1. Cuisine Type Analysis - Comparing average delivery and preparation times across cuisines
# Aggregating average times per cuisine
cuisine_times = df.groupby('cuisine_type')[['food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 cuisines to prevent memory overload
top_cuisines = df['cuisine_type'].value_counts().index[:20]
cuisine_times_filtered = cuisine_times[cuisine_times['cuisine_type'].isin(top_cuisines)]
# Plotting average food preparation time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='food_preparation_time', hue='cuisine_type', data=cuisine_times_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by cuisine type
plt.figure(figsize=(10, 6))
sns.barplot(x='cuisine_type', y='delivery_time', hue='cuisine_type', data=cuisine_times_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Cuisine Type (Top 20)')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()
Observations:¶
Food Preparation Time by Cuisine Type:
- Certain cuisines take significantly longer to prepare than others.
The variation in preparation time could be due to differences in dish complexity, cooking techniques, or restaurant efficiency.
Delivery Time by Cuisine Type: - Delivery times also vary across cuisines, which may be influenced by restaurant locations, demand levels, or how well the food travels.
Some cuisines might be more frequently ordered from farther distances, leading to longer delivery times.
In [88]:
# 2. Restaurant Performance Analysis - Comparing average ratings, delivery times, and preparation times across restaurants
# Aggregating performance metrics per restaurant
restaurant_performance = df.groupby('restaurant_name')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Limiting to top 20 most popular restaurants to prevent memory overload
top_restaurants = df['restaurant_name'].value_counts().index[:20]
restaurant_performance_filtered = restaurant_performance[restaurant_performance['restaurant_name'].isin(top_restaurants)]
# Plotting average ratings by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='rating', hue='restaurant_name', data=restaurant_performance_filtered, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.show()
# Plotting average food preparation time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='food_preparation_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()
# Plotting average delivery time by restaurant
plt.figure(figsize=(14, 6))
sns.barplot(x='restaurant_name', y='delivery_time', hue='restaurant_name', data=restaurant_performance_filtered, palette='magma', dodge=False)
plt.title('Average Delivery Time by Restaurant (Top 20)')
plt.xlabel('Restaurant Name')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()
Observations¶
Average Ratings by Restaurant:
- Some restaurants consistently receive higher ratings than others, indicating better customer satisfaction.
The variation in ratings may be influenced by factors such as food quality, service efficiency, and pricing.
Average Food Preparation Time by Restaurant: - Certain restaurants take significantly longer to prepare food than others.
This could be due to the type of cuisine, restaurant efficiency, or the complexity of menu items.
Average Delivery Time by Restaurant: - Some restaurants have longer delivery times on average, which could be due to location, demand, or delivery efficiency.
Identifying which restaurants have consistently longer delivery times may help FoodHub optimize delivery logistics.
In [92]:
# 3. Day of the Week Analysis - Comparing ratings, preparation times, and delivery times with hue set to x
# Aggregating performance metrics per day of the week
day_performance = df.groupby('day_of_the_week')[['rating', 'food_preparation_time', 'delivery_time']].mean().reset_index()
# Plotting average ratings by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='rating', hue='day_of_the_week', data=day_performance, palette='coolwarm', dodge=False)
plt.title('Average Customer Rating by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rating')
plt.show()
# Plotting average food preparation time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='food_preparation_time', hue='day_of_the_week', data=day_performance, palette='viridis', dodge=False)
plt.title('Average Food Preparation Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.show()
# Plotting average delivery time by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', hue='day_of_the_week', data=day_performance, palette='magma', dodge=False)
plt.title('Average Delivery Time by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Delivery Time (minutes)')
plt.show()
Observations:¶
Average Ratings by Day of the Week:
- Ratings remain relatively consistent across weekdays and weekends, indicating stable customer satisfaction regardless of the day.
Average Food Preparation Time by Day of the Week: - There is a slight increase in preparation times on weekends, suggesting a higher volume of orders or operational constraints.
Restaurants might experience increased kitchen workloads on weekends, leading to slightly slower preparation.
Average Delivery Time by Day of the Week: - Delivery times show a similar pattern, with weekends experiencing slightly longer durations.
Higher weekend demand could be affecting overall delivery efficiency.
Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]¶
In [93]:
# Write the code here
# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
qualified_restaurants
Out[93]:
| count | mean | |
|---|---|---|
| restaurant_name | ||
| Blue Ribbon Fried Chicken | 96 | 4.218750 |
| Blue Ribbon Sushi | 119 | 4.134454 |
| Parm | 68 | 4.073529 |
| RedFarm Broadway | 59 | 4.169492 |
| RedFarm Hudson | 55 | 4.109091 |
| Shake Shack | 219 | 4.168950 |
| The Meatball Shop | 132 | 4.689394 |
In [96]:
# Group by restaurant and calculate rating count and average rating
restaurant_ratings = df.groupby('restaurant_name')['rating'].agg(['count', 'mean'])
# Filter restaurants meeting the criteria (more than 50 ratings and average rating above 4)
qualified_restaurants = restaurant_ratings[(restaurant_ratings['count'] > 50) & (restaurant_ratings['mean'] > 4)]
# Sort the qualified restaurants by average rating in descending order
qualified_restaurants_sorted = qualified_restaurants.sort_values(by='mean', ascending=False).reset_index()
# Create a bar chart with sorted restaurants and assign the y variable to hue
plt.figure(figsize=(10, 5))
sns.barplot(y='restaurant_name', x='mean', hue='restaurant_name', data=qualified_restaurants_sorted, palette="coolwarm", dodge=False)
plt.title("Restaurants Eligible for Promotional Offer (Sorted by Avg Rating)")
plt.xlabel("Average Rating")
plt.ylabel("Restaurant Name")
plt.xlim(4, 5) # Set x-axis range to focus on valid rating values
plt.show()
Observations:¶
This bar chart displays restaurants that qualify for a promotional offer based on:
Having more than 50 ratings (ensuring enough data points for reliability).
Maintaining an average rating above 4.0.
Restaurants are sorted in descending order of average rating, highlighting the best-rated ones.
The hue is assigned to the restaurant name, ensuring distinct colors for each.
Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]¶
In [97]:
# Write the code here
# Define commission rates based on order cost
def calculate_commission(cost):
if cost > 20:
return cost * 0.25 # 25% commission for orders above $20
elif cost > 5:
return cost * 0.15 # 15% commission for orders above $5
else:
return 0 # No commission for orders $5 or below
# Apply the commission calculation to each order
df['commission'] = df['cost_of_the_order'].apply(calculate_commission)
# Calculate total revenue generated by the company
total_revenue = df['commission'].sum()
total_revenue
Out[97]:
6166.303
Observations:¶
Net Revenue Generated by the Company:
The total revenue generated by the company from commissions across all orders is $6,166.30.
Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]¶
In [100]:
# Write the code here # Calculate total time from order placement to delivery df['total_time_from_order_to_delivery'] = df['food_preparation_time'] + df['delivery_time'] # Count total number of orders total_orders = len(df) # Count orders that took more than 60 minutes orders_above_60_min = df[df['total_time_from_order_to_delivery'] > 60].shape[0] # Calculate percentage of orders taking more than 60 minutes percentage_above_60_min = (orders_above_60_min / total_orders) * 100 round(percentage_above_60_min, 2)
Out[100]:
10.54
Observations:¶
10.54% of orders take more than 60 minutes from order placement to delivery.
Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]¶
In [102]:
# Write the code here
# Calculate the mean delivery time for weekdays and weekends
delivery_time_analysis = df.groupby('day_of_the_week')['delivery_time'].mean().reset_index()
# Plotting the variation in mean delivery time
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='delivery_time', data=delivery_time_analysis, palette='magma', hue='day_of_the_week', dodge=False)
plt.title('Mean Delivery Time on Weekdays vs. Weekends')
plt.xlabel('Day of the Week')
plt.ylabel('Mean Delivery Time (minutes)')
plt.show()
# Display mean delivery times for weekdays and weekends
delivery_time_analysis
Out[102]:
| day_of_the_week | delivery_time | |
|---|---|---|
| 0 | Weekday | 28.340037 |
| 1 | Weekend | 22.470022 |
Observations:¶
The mean delivery time is lower on weekends than on weekdays.
Possible reasons:
Lower traffic congestion on weekends, leading to faster deliveries.
More efficient delivery management due to higher expected demand on weekends.
Fewer business district orders (which might take longer due to distance or traffic).
Conclusion and Recommendations¶
Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]¶
Conclusions:¶
My analysis of FoodHub’s order and delivery data provides valuable insights into restaurant performance, customer satisfaction, and operational efficiency. Key findings include:
- Delivery & Preparation Time Insights
- 10.54% of orders take more than 60 minutes from order placement to delivery, which could negatively impact customer satisfaction.
- Mean delivery time is longer on weekdays (28.34 min) than on weekends (22.47 min), suggesting that traffic congestion or restaurant availability might be factors affecting delivery speed.
- Some cuisine types and restaurants have significantly longer preparation times, which contributes to overall delays.
- Restaurant & Cuisine Performance
- Certain restaurants consistently receive high ratings (above 4.0) with more than 50 orders, making them strong candidates for promotional offers.
- Some cuisines have higher preparation times than others, affecting overall delivery efficiency.
- The most profitable orders come from customers who spend more than $20, as they generate the highest commissions for FoodHub.
- Customer Satisfaction & Revenue
- Customer ratings are not strongly correlated with delivery time, suggesting that food quality and overall experience play a bigger role in satisfaction.
- The company generated 6,166.30 dollars in commission revenue, with most revenue coming from orders above $20.
- Certain restaurants consistently take longer to prepare food, which might contribute to lower ratings.
Recommendations:¶
- Improve Delivery & Operational Efficiency
- Optimize restaurant partnerships: Identify restaurants with consistent delays and work with them to improve preparation efficiency.
- Prioritize high-performing restaurants: Promote and feature restaurants with high ratings and good efficiency to improve customer satisfaction.
- Reduce weekday delivery delays: Explore better delivery scheduling and traffic-optimized routing for faster weekday deliveries.
- Enhance Customer Satisfaction & Engagement
- Encourage high-rated restaurants: Offer incentives or priority placement in the app for restaurants maintaining an average rating above 4.0.
- Improve customer communication: Notify customers in real-time if orders are expected to take longer than 60 minutes to manage expectations.
- Introduce loyalty rewards: Offer discounts or free delivery for customers who frequently order from top-rated restaurants.
- Revenue Optimization Strategies
- Increase commissions on premium orders: Since most revenue comes from orders above $20, consider adjusting pricing models to encourage higher-value purchases.
- Feature premium & high-performing cuisines: If certain cuisines generate higher-value orders, highlight them in the app for better visibility.
- Reduce inefficiencies in long-prep-time cuisines: Work with restaurants offering slow-prep cuisines to improve kitchen efficiency and reduce preparation times.
