Predicting the Future of Product Sales and Crafting the Optimal Inventory Strategy

Author: Simon Tagbor 📅 6th February, 2024

Introduction

Navigating the ever-shifting landscape of customer demand is a core challenge that supply chain professionals face. Supply chain professionals often combine their expert intuition with some basic statistical techniques to infer the demand for products based on historical data. This process is relatively straightforward when dealing with smaller product categories and stable demand patterns. However, As businesses grow and encounter larger product categories with constantly shifting consumer demands, this traditional demand forecasting approach may not scale.

In this notebook, we will explore a more modern and scalable alternative – a data-driven, programmatic approach to demand forecasting. We will build a demand forecasting model with Python. We will also use this model for inventory optimization, covering concepts like reorder points, safety stock, and economic order quantity (EOQ).

Throughout this project, we will address two key business questions:

  1. What is the demand forecast for the top-selling product in the next 24 months?
  2. What is the optimal inventory level for the product?

Join me on this exploration as we seek answers, and feel free to share your thoughts and feedback on my approach.

You can find the source code for this project at my github page. You can also find the Jupyter notebook and the dataset on my kaggle page.

Project Outline

For this project, I completed the following tasks:

Problem Statement

Large product categories and constantly shifting consumer demand patterns introduce a scaling challenge for traditional demand forecasting techniques. There is a need for an approach that reduces the level of guesswork and reduces the avoidable costly outcomes of poor inventory optimizations.

Prerequisites

To follow along you need to have a basic understanding of the following:

If you’re already familiar with these concepts, you’re good to go!

gif

Project Dependencies

Show the code
# import project libraries
import pandas as pd
import numpy as np # for linear algebra
import math # for math operations 

import seaborn as sns # for plotting

# handling files
import os 
import sys 

# data preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


# Model Building and Fitting
from sklearn.ensemble import RandomForestClassifier
from prophet import Prophet



# Model Evaluation and Tuning
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt # for plotting
import squarify # for tree maps

Exploratory Data Analysis

The working dataset contains entries of customer demand information The data contains 53 features (columns)

To understand the data, let’s perform some exploratory data analysis. We will use the following techniques:

Visual Inspection of Data

df = pd.read_csv("data/DataCoSupplyChainDataset.csv", encoding="ISO-8859-1")
df.head(2)
Type Days for shipping (real) Days for shipment (scheduled) Benefit per order Sales per customer Delivery Status Late_delivery_risk Category Id Category Name Customer City ... Order Zipcode Product Card Id Product Category Id Product Description Product Image Product Name Product Price Product Status shipping date (DateOrders) Shipping Mode
0 DEBIT 3 4 91.250000 314.640015 Advance shipping 0 73 Sporting Goods Caguas ... NaN 1360 73 NaN http://images.acmesports.sports/Smart+watch Smart watch 327.75 0 2/3/2018 22:56 Standard Class
1 TRANSFER 5 4 -249.089996 311.359985 Late delivery 1 73 Sporting Goods Caguas ... NaN 1360 73 NaN http://images.acmesports.sports/Smart+watch Smart watch 327.75 0 1/18/2018 12:27 Standard Class

2 rows × 53 columns

To explore the spread of the data, we will use the describe() method to get the summary statistics of the data.

# retrieve the number of columns and rows
df.describe()
Days for shipping (real) Days for shipment (scheduled) Benefit per order Sales per customer Late_delivery_risk Category Id Customer Id Customer Zipcode Department Id Latitude ... Order Item Quantity Sales Order Item Total Order Profit Per Order Order Zipcode Product Card Id Product Category Id Product Description Product Price Product Status
count 180519.000000 180519.000000 180519.000000 180519.000000 180519.000000 180519.000000 180519.000000 180516.000000 180519.000000 180519.000000 ... 180519.000000 180519.000000 180519.000000 180519.000000 24840.000000 180519.000000 180519.000000 0.0 180519.000000 180519.0
mean 3.497654 2.931847 21.974989 183.107609 0.548291 31.851451 6691.379495 35921.126914 5.443460 29.719955 ... 2.127638 203.772096 183.107609 21.974989 55426.132327 692.509764 31.851451 NaN 141.232550 0.0
std 1.623722 1.374449 104.433526 120.043670 0.497664 15.640064 4162.918106 37542.461122 1.629246 9.813646 ... 1.453451 132.273077 120.043670 104.433526 31919.279101 336.446807 15.640064 NaN 139.732492 0.0
min 0.000000 0.000000 -4274.979980 7.490000 0.000000 2.000000 1.000000 603.000000 2.000000 -33.937553 ... 1.000000 9.990000 7.490000 -4274.979980 1040.000000 19.000000 2.000000 NaN 9.990000 0.0
25% 2.000000 2.000000 7.000000 104.379997 0.000000 18.000000 3258.500000 725.000000 4.000000 18.265432 ... 1.000000 119.980003 104.379997 7.000000 23464.000000 403.000000 18.000000 NaN 50.000000 0.0
50% 3.000000 4.000000 31.520000 163.990005 1.000000 29.000000 6457.000000 19380.000000 5.000000 33.144863 ... 1.000000 199.919998 163.990005 31.520000 59405.000000 627.000000 29.000000 NaN 59.990002 0.0
75% 5.000000 4.000000 64.800003 247.399994 1.000000 45.000000 9779.000000 78207.000000 7.000000 39.279617 ... 3.000000 299.950012 247.399994 64.800003 90008.000000 1004.000000 45.000000 NaN 199.990005 0.0
max 6.000000 4.000000 911.799988 1939.989990 1.000000 76.000000 20757.000000 99205.000000 12.000000 48.781933 ... 5.000000 1999.989990 1939.989990 911.799988 99301.000000 1363.000000 76.000000 NaN 1999.989990 0.0

8 rows × 29 columns

Click to see some Notable Observation of the Data
  1. Aproximately 55% of orders had late delivery risks.

  2. Aproximately 75% of products cost $199.99

  3. All the products are available.

  4. 75% of customers bought goods worth at least $247.40

Further inspection of the data will help us understand the data better.

Data Preprocessing

we will focus on historical sales data, and product attributes like; stock level, and product category, we will also analyze the impact of other variables that contribute to demand patterns including geographic factors, customer segments and lead time.

Preprocessing Tasks

Based on the above, we will drop the majority of the columns that are not relevant for forecasting the demand and extract new features from the existing columns

Drop Irrelevant Columns

Show the code
# drop irrelevant columns
def drop_columns(df, columns_to_drop):
    try:
        df = df.drop(columns=columns_to_drop)
        print(f"{len(columns_to_drop)} columns dropped successfully. Number of columns remaining: {len(df.columns)}")
        return df
    except KeyError as e:
        print(f"""Column(s): {e} not found in dataframe.
              
            No columns dropped.
            Please Check that the column names are correct.""")
        return df

# Specify the columns to keep
colums_to_keep = ['Days for shipping (real)', 
                  'Days for shipment (scheduled)',
                  'Customer Country',
                  'Sales per customer',
                  'Delivery Status', 
                  'Late_delivery_risk', 
                  'Customer City',
                  'Customer Segment',
                  'Sales','Shipping Mode',
                  'Type', 'Product Card Id',
                  'Customer Zipcode', 
                  'Product Category Id', 
                  'Product Name',                    
                  'Product Price',
                  'Market', 
                  'Product Status',
                  'shipping date (DateOrders)',]

# Specify the columns to drop
columns_to_drop = [col for col in df.columns if col not in colums_to_keep ]

df = drop_columns(df, columns_to_drop)
34 columns dropped successfully. Number of columns remaining: 19

Drop Rows with Missing Values

# drop customer Zip code.
df = df.drop(columns=['Customer Zipcode'])

A Quick Spot Check for Missing Values

### Check for Missing values
def check_null_values(df):
    null_values = df.isnull().sum()
    if null_values.sum() == 0:
        print("No null values found ✅")
    else:
        print("⚠️ Null values found in the following columns:")
        for column, null_count in null_values.iteritems():
            if null_count > 0:
                print(f"{column}: {null_count}")

# Use the function
check_null_values(df)
No null values found ✅

In the code above, df.isnull().sum() returns a Series where the index is the column names and the values are the count of null values in each column. If the sum of these counts is 0, it means there are no null values in the DataFrame, so it prints “No null values found”. Otherwise, it iterates over the Series and prints the column names and counts of null values.

Create New Features

The dataset contains a shipping date column which is a DateTime object from which we can extract Month, Year, Day and Day of Week that can be useful in our analysis.

we need to also create a new Lead Time column which is the difference between the Days for shipment (scheduled) and the Days for shipping (real). This will help us understand the impact of lead time on demand.

Show the code
# Create month, Year, Day, and Weekday columns from Shipping Date
def extract_date_parts(df, date_column, prefix):
    try:
        df[date_column] = pd.to_datetime(df[date_column])
        df[f'{prefix} Year'] = df[date_column].dt.year
        df[f'{prefix} Month'] = df[date_column].dt.month
        df[f'{prefix} Day'] = df[date_column].dt.day
        df[f'{prefix} Weekday'] = df[date_column].dt.weekday
        # verify and notify that the columns have been created
        if f'{prefix} Year' in df.columns and f'{prefix} Month' in df.columns and f'{prefix} Day' in df.columns and f'{prefix} Weekday' in df.columns:
            print(f"✅ Success! Columns Created: {prefix} Year, {prefix} Month, {prefix} Day, and {prefix} Weekday")
            return df
        else:
            print("Error creating columns. Please check that the date column name is correct.")
    except Exception as e:
        print(f"Error creating columns: {e}")
        return df
# Add Lead Time Feature from Days for shipping (real) and Days for shipment (scheduled)
df['Lead Time'] = df['Days for shipping (real)'] - df['Days for shipment (scheduled)']

# Use the function to extract date parts
df = extract_date_parts(df, 'shipping date (DateOrders)', 'Shipping')
✅ Success! Columns Created: Shipping Year, Shipping Month, Shipping Day, and Shipping Weekday
# display the shape of the data frame
df.shape
(180519, 23)

Now we have 23 columns and 180519 entries (rows) in the dataset.

Data Encoding

The nature of categorical data makes it unsuitable for future analysis. For instance, machine learning models can’t work with categorical values for customer origins like UK, USA, France, etc. We will convert these categorical values to numerical values using the LabelEncoder from the sklearn library.

I will also perform a one-hot encoding technique on categorical features for future machine learning modeling tasks.

I wrote a prepare_data() function that returns two preprocessed dataframes: one that is encoded using a label encoder function and the other encoded using one hot encoding technique.

You can learn about encoding techniques for categorical variables here

Show the code
# Select top selling product
top_product = df['Product Card Id'].value_counts().index[0]
# get top product ID
print(f"Filtering and Encoding Dataset for Top Product ID: {top_product}")

from sklearn.preprocessing import LabelEncoder

def prepare_data(df, product_card_id, categorical_cols, columns_to_drop):
    """
    Prepare a DataFrame for bivariate analysis and machine learnin
    g by applying label encoding and one-hot encoding to categorical 
    columns and dropping specified columns.

    Parameters:
    df (pandas.DataFrame): The original DataFrame.
    product_card_id (int): The product card ID to filter the DataFrame on.
    categorical_cols (list of str): The names of the categorical columns to apply encoding to.
    columns_to_drop (list of str): The names of the columns to drop from the DataFrame.

    Returns:
    pandas.DataFrame: The label encoded DataFrame for bivariate analysis.
    pandas.DataFrame: The one-hot encoded DataFrame for machine learning.
    """
    try:
        df_copy = df[df['Product Card Id'] == product_card_id].copy()  # create a copy

        # label encoding
        label_encoder = LabelEncoder()
        df_label_encoded = df_copy.copy()

        # Apply label encoding to categorical variables in place
        for col in categorical_cols:
            df_label_encoded[col] = label_encoder.fit_transform(df_label_encoded[col])

        # Drop specified columns
        df_label_encoded = df_label_encoded.drop(columns=columns_to_drop)

        # one-hot encoding
        df_one_hot_encoded = pd.get_dummies(df_copy, columns=categorical_cols)

        # Drop specified columns
        df_one_hot_encoded = df_one_hot_encoded.drop(columns=columns_to_drop)
        print("Data Encoding successful. ✅")
        return  df_one_hot_encoded, df_label_encoded
    except Exception as e:
        print(f"Error preparing data: {e}")
        return None, None

# Use the function to prepare the data for bivariate analysis
categorical_cols = ['Type', 'Customer Segment', 
                    'Delivery Status', 
                    'Customer City', 
                    'Market',
                    'Shipping Mode']

columns_to_drop = ['Product Name',
                   'Days for shipment (scheduled)', 
                   'Sales per customer', 
                   'Days for shipping (real)',
                   'Customer Country', 
                   'shipping date (DateOrders)', 
                   'Product Card Id', 
                   'Product Category Id', 
                   'Product Status', 
                   'Product Price']

# drop columns and encode data for correlation martrix and Machine learning
onehot_encode_df, label_encode_df = prepare_data(df, top_product, categorical_cols, columns_to_drop)

# rename Type column to Payment Type
label_encode_df = label_encode_df.rename(columns={'Type': 'Payment Type'})
onehot_encode_df = onehot_encode_df.rename(columns={'Type': 'Payment Type'})
Filtering and Encoding Dataset for Top Product ID: 365
Data Encoding successful. ✅

Confirm Encoding of Dataset

label_encode_df.dtypes
Payment Type            int64
Delivery Status         int64
Late_delivery_risk      int64
Customer City           int64
Customer Segment        int64
Market                  int64
Sales                 float64
Shipping Mode           int64
Lead Time               int64
Shipping Year           int32
Shipping Month          int32
Shipping Day            int32
Shipping Weekday        int32
dtype: object
# validate the one-hot encoding
onehot_encode_df.dtypes
Late_delivery_risk                int64
Sales                           float64
Lead Time                         int64
Shipping Year                     int32
Shipping Month                    int32
                                 ...   
Market_USCA                        bool
Shipping Mode_First Class          bool
Shipping Mode_Same Day             bool
Shipping Mode_Second Class         bool
Shipping Mode_Standard Class       bool
Length: 589, dtype: object

Finally…Data Preprocessing Completed!!

gif

The dataset is now ready for further analysis and modeling. we can now proceed to conduct exploratory data visualizations to understand the distribution of the data better.

Exploratory Data Visualizations

To highlight the distributions of the individual variables as well as the relationship between the variables and the target variables, I used the following techniques:

Univariate Analysis

Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

Visualizing the Distribution of the Dataset

Show the code
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 15))
fig.suptitle('Distribution Plots for Selected Variables', 
             fontsize=16)
# Create a copy of the DataFrame
df_copy = df.copy()


# Plotting  the top ten products per Product Card Id
sns.countplot(data=df_copy, x='Product Card Id',
                color='blue', ax=axes[0, 0], 
                order=df_copy['Product Card Id'].value_counts().iloc[:10].index)
axes[0, 0].set_title('Distribution of Top Ten Product Id')
axes[0, 0].set_xlabel('Product Card Id')
axes[0, 0].set_ylabel('Count')


# Plotting Value of sales in  dollars
sns.histplot(data=df_copy, x='Sales', 
             kde=True, color='salmon', 
             bins=30, linewidth=2,
             ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Sales')
axes[0, 1].set_xlabel('Sales value in Dollars')
axes[0, 1].set_ylabel('Frequency')


# Plotting Sales Value per customer
sns.histplot(data=df_copy, x='Sales per customer',
             bins=30, kde=True, linewidth=2,
             color='lightblue', ax=axes[0, 2])
axes[0, 2].set_title('Distribution of Sales per Customer')
axes[0, 2].set_xlabel('Sales per Customer')
axes[0, 2].set_ylabel('Frequency')

# Ploting the distribution of Product Price
sns.histplot(data=df_copy, x='Product Price', bins=30, kde=True, 
             color='lightgreen', linewidth=2, ax=axes[1, 0])

axes[1, 0].set_title('Distribution of Product Price')
axes[1, 0].set_xlabel('Product Price')

# ploting a tree map for Customer Segment
squarify.plot(sizes=df_copy['Customer Segment'].value_counts(), 
              label=df_copy['Customer Segment'].value_counts().index, 
              color=sns.color_palette("Set3"), ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Customer Segment - Treemap')

# ploting a tree map for Top Ten Product Category Id
squarify.plot(sizes=df_copy['Product Category Id'].value_counts().iloc[:10],
                label=df_copy['Product Category Id'].value_counts().iloc[:10].index,
                color=sns.color_palette("Set2"), ax=axes[1, 2])
axes[1, 2].set_title('Distribution of Top Ten Product Category Id - Treemap')

# Plotting the distribution of Delivery Status
sns.countplot(data=df_copy, x='Delivery Status',
                color='pink', ax=axes[2, 0])
axes[2, 0].set_title('Distribution of Delivery Status')
axes[2, 0].set_xlabel('Delivery Status')
axes[2, 0].set_ylabel('Count')


# Plotting the distribution Payment Type with stacked bar chart
df_copy.groupby(['Type'])['Type'].count().plot(kind='bar', 
                                               stacked=True,
                                               ax=axes[2, 1])

axes[2, 1].set_title('Distribution of Payment Type')
axes[2, 1].set_xlabel('Payment Type')
axes[2, 1].set_ylabel('Count')

# Plotting the Distribution of top ten Customer Country
sns.countplot(data=df_copy, x='Customer Country',
                color='orange', ax=axes[2, 2], 
                order=df_copy['Customer Country'].value_counts().iloc[:10].index)
axes[2, 2].set_title('Distribution of Customer Country')
axes[2, 2].set_xlabel('Customer Country')
axes[2, 2].set_ylabel('Count')



# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Show the plots
plt.show()

Click to Read My Observations!

The Top Selling Product ID is 365 which corresponds to a product name: Perfect Fitness Perfect Rip Deck this indicates a fast-moving product. I will focus the demand forecasting process on this product going forward

The distribution of Sales Value and Sales per customer are both positively skewed with a long tail. This indicates that the majority of sales are for low-value products. This is an interesting insight because it may suggest that the majority of customers are price-sensitive.

The distribution of Product Price is also positively skewed with a long tail. This means that the majority of products are low-value products.

The distribution of Customer Segment indicates that the majority of customers are from the consumer segment.

NOTE: Based on the insight from the univariate analysis, The rest of the analysis and forecasting will focus on the top-selling Product Card Id (365 ‘Perfect Fitness Perfect Rip Deck’)

Exploratory Time Series Visualisation

To understand the demand patterns of the top-selling product, let’s create a time series heatmap to visualize the demand patterns of the top-selling product over time.

Time Series HeatMap of The Demand

Show the code
import seaborn as sns
import matplotlib.pyplot as plt
import calendar

# Extract shipping date (DateOrders) and Sales columns
df_heatmap = df[['shipping date (DateOrders)', 'Sales']]
# Assuming 'df' is your original dataframe

df_heatmap.set_index('shipping date (DateOrders)', inplace=True)
resampled_df = df_heatmap.resample('M').sum()  # Resample to yearly frequency
# Set x-axis ticks to represent months and years
month_labels = [calendar.month_abbr[m.month] + '-' + str(m.year) for m in resampled_df.index]
# Plot the heatmap
plt.figure(figsize=(20, 10))
sns.heatmap(resampled_df.T, cmap='YlGnBu', cbar_kws={'label': 'Sales'})
plt.xticks(ticks=range(len(month_labels)), labels=month_labels, rotation=80, ha='right')

plt.title('Time Series Heatmap of Sales (Aggregated by Month)')
plt.xlabel('Month and Year')


plt.show()

Judging from consistency in the shades of the heatmap, we can see that the demand for the top-selling product is fairly stable over time. However, it is interesting to note that the number of sales recorded for the first quarters of 2015, 2016 and 2017 remained consistent however in 2018 the number of sales recorded in the first quarter dipped significantly. This is an interesting insight that we can explore further.

Next, Let’s use the Prophet library to model the demand for the top-selling product. This will help us understand the cyclical patterns in the demand for the top-selling product

Forecasting Demand with Prophet

Prophet is a forecasting tool developed by Facebook. It is designed for analyzing time series data that display patterns on different time scales such as yearly, weekly, and daily. It also has advanced capabilities for modeling the effects of holidays on a time series and implementing custom seasonalities. see the documentation here

Show the code
# import prophet
from prophet import Prophet

prophet_df = df.copy()
prophet_df = prophet_df.rename(columns={'shipping date (DateOrders)': 'ds', 'Sales': 'y'})
# Add custom Puerto Rico holidays
# Read the CSV file
holidays_df = pd.read_csv('data/puertorican_holidays.csv')

# Rename the 'Date' column to 'ds' and the 'Name' column to 'holiday'
holidays_df = holidays_df.rename(columns={'Date': 'ds', 'Name': 'holiday'})

# Drop the 'Type' column as it's not needed
holidays_df = holidays_df.drop(columns=['Type'])

# Add 'lower_window' and 'upper_window' columns
holidays_df['lower_window'] = 0
holidays_df['upper_window'] = 1

# Convert 'ds' to DateTime
holidays_df['ds'] = pd.to_datetime(holidays_df['ds'])

# Create a Prophet instance and provide the holidays DataFrame
prophet = Prophet(holidays=holidays_df)

prophet.fit(prophet_df)

# Create a DataFrame with future dates for forecasting
future = prophet.make_future_dataframe(periods=365, freq='D')

# Generate forecasts
forecast = prophet.predict(future) 
16:55:01 - cmdstanpy - INFO - Chain [1] start processing
17:49:07 - cmdstanpy - INFO - Chain [1] done processing

The code above uses the Prophet library to model the demand for the top-selling product. The model is trained on the Sales and Shipping Date columns. The model is then used to forecast the demand for the top-selling product over the next 365 days.

The code also included Puerto Rican holidays to account for the impact of holidays on the demand for the top-selling product. This is important because holidays can have a significant impact on demand patterns.

You might wonder why Puerto Rican holidays were included in the model. From the univariate analysis conducted earlier, we discovered that most of the orders were coming from Puerto Rico. The forecast variable now contains the forecasted values for the top-selling product. we will work with the variable later but for now, let’s evaluate the accuracy of our prophet model

Evaluating the Accuracy of the Time Series Forecast

To determine the accuracy of the prophet model, we will use the cross_validationa() function provided by Prophet

from prophet.diagnostics import cross_validation, performance_metrics
# Perform cross-validation
df_cv = cross_validation(model=prophet, initial='730 days', period='365 days', horizon='365 days')
18:01:01 - cmdstanpy - INFO - Chain [1] start processing
18:07:08 - cmdstanpy - INFO - Chain [1] done processing

The cross_validation() function performs cross-validation on the model. It trains the model on a subset of the data and then evaluates the model on the remaining data. This is a good way to evaluate the accuracy of the model. The initial parameter specifies the size of the training set. The period parameter specifies the frequency of the forecast.

Let’s visualize the performance of the model

# Plot MAPE
from prophet.plot import plot_cross_validation_metric
#  set fig size
plt.figure(figsize=(9, 6))
fig = plot_cross_validation_metric(df_cv, metric='mape')
fig.get_axes()[0].get_lines()[0].set_markerfacecolor('#ADD8E7')  # Change color of the dots
fig.get_axes()[0].get_lines()[0].set_markeredgecolor('#ADD8E7')  # Change color of the dot edges
<Figure size 900x600 with 0 Axes>

The forecast has lower MAPE (Mean Absolute Percentage Error) values for horizons within the 200-day range however the accuracy drops for horizons beyond 250 days. This suggests that the model is making more errors at periods beyond 250 days.

The model will be most useful to stakeholders if it can forecast demand beyond 250 days with a lower percentage of errors. Exposing the model to more historical data may help lower the MAPE significantly. nonetheless, let’s explore if there are opportunities to improve the accuracy by finding the best combinations of hyperparameters for the prophet model. I will use a hyperparameter tuning technique to try to optimize the model’s performance.

Finding the Best Hyperparameter Combination for Lower MAPE

Show the code
from sklearn.model_selection import ParameterGrid

# Assuming prophet_df is your DataFrame with 'ds' and 'y' columns
prophet_df = df.copy()
prophet_df = prophet_df.rename(columns={'shipping date (DateOrders)': 'ds', 'Sales': 'y'})

# Specify hyperparameter values to try
param_grid = {
    'seasonality_mode': ["additive", 'multiplicative'],
    'seasonality_prior_scale': [1, 5, 10, 20],
    'holidays_prior_scale': [5, 10, 20, 25],
    'changepoint_prior_scale': [0.005, 0.01, 0.05, 0.1]
}

# Generate all combinations of hyperparameters using ParameterGrid
param_combinations = ParameterGrid(param_grid)

The code above uses the ParameterGrid function from the sklearn library to create a grid of hyperparameters. The grid contains different combinations of hyperparameters for the prophet model.

On the other hand, the code below uses the cross_validation() function to evaluate the accuracy of the model for each combination of hyperparameters. The code then selects the combination of hyperparameters that results in the lowest MAPE.

Show the code
from itertools import product
# Store results in a dictionary
results = {}
print(f"trying all {len(param_combinations)} hyperparameter combinations")
# Generate all combinations of hyperparameters
param_combinations = list(product(*param_grid.values()))

for params in param_combinations:
    # Create a Prophet instance with current hyperparameter values
    prophet = Prophet(**dict(zip(param_grid.keys(), params)))

    # Fit the model
    prophet.fit(prophet_df)

    # Perform cross-validation
    df_cv = cross_validation(model=prophet, initial='730 days', period='365 days', horizon='365 days')


    # Calculate performance metrics
    df_metrics = performance_metrics(df_cv, rolling_window=0)

    # Store metrics in the results dictionary
    results[params] = df_metrics['mape'].mean()

NOTE: The code took a very long time to complete. It tried 128 different combinations of hyperparameters and the best model was the one with the lowest MAPE value.

The results are in! The best model had the following hyperparameters:

# Find the hyperparameters with the lowest RMSE
best_hyperparams = min(results, key=results.get)
print(f"Best Hyperparameters: {dict(zip(param_grid.keys(), best_hyperparams))}")
Best Hyperparameters: {'seasonality_mode': 'additive', 'seasonality_prior_scale': 1, 'holidays_prior_scale': 5, 'changepoint_prior_scale': 0.005}

Now let’s rebuild the model with the best hyperparameters and evaluate the model’s performance.

tuned_prophet = Prophet(holidays=holidays_df, 
                        seasonality_mode='additive', 
                        seasonality_prior_scale=1, 
                        holidays_prior_scale=5, 
                        changepoint_prior_scale=0.005)
# fit the model
tuned_prophet.fit(prophet_df)
# Create a DataFrame with future dates for forecasting
future = tuned_prophet.make_future_dataframe(periods=365, freq='D')

# Generate forecasts
new_forecast = tuned_prophet.predict(future)
19:27:47 - cmdstanpy - INFO - Chain [1] start processing
19:53:04 - cmdstanpy - INFO - Chain [1] done processing
Cross Validation of the Best Model
from prophet.diagnostics import cross_validation, performance_metrics
# Perform cross-validation
tuned_df_cv = cross_validation(model=tuned_prophet, initial='730 days', period='365 days', horizon='365 days')
20:34:36 - cmdstanpy - INFO - Chain [1] start processing
20:40:55 - cmdstanpy - INFO - Chain [1] done processing

let’s compare the accuracy of the model before and after hyperparameter tuning.

Show the code
fig, axs = plt.subplots(1, 2, figsize=(15, 9))

# Plot the first cross-validation metric
fig1 = plot_cross_validation_metric(df_cv, metric='mape', ax=axs[0])
fig1.get_axes()[0].get_lines()[0].set_markerfacecolor('#ADD8E7')  # Change color of the dots
fig1.get_axes()[0].get_lines()[0].set_markeredgecolor('#ADD8E7')  # Change color of the dot edges
# add title
axs[0].set_title('Initial Cross-Validation score MAPE')

# Plot the second cross-validation metric
fig2 = plot_cross_validation_metric(tuned_df_cv, metric='mape', ax=axs[1])
fig2.get_axes()[0].get_lines()[0].set_markerfacecolor('#ADD8E7')  # Change color of the dots
fig2.get_axes()[0].get_lines()[0].set_markeredgecolor('#ADD8E9')  # Change color of the dot edges
# add title
axs[1].set_title('Tuned Cross-Validation score MAPE')

plt.tight_layout()
plt.show()

Not Exactly the outcome I was expecting but the tuned model’s performance remains consistent with the previous model. This may suggest that the model is not sensitive to the hyperparameters. Nonetheless, the model is still useful for forecasting demand for the top-selling product.

Forecast Results

As indicated earlier, the forecast variable contains the forecasted values of our Sales time series. Based on this forecast we will calculate the optimal inventory policy for this specific product.

The forecast variable is a dataframe that contains the following columns:

forecast.head(2)
ds trend yhat_lower yhat_upper trend_lower trend_upper American Citizenship Day American Citizenship Day_lower American Citizenship Day_upper Christmas Day ... weekly weekly_lower weekly_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2015-01-03 00:00:00 189.503452 66.035120 371.245609 189.503452 189.503452 0.0 0.0 0.0 0.0 ... -1.288207 -1.288207 -1.288207 33.801001 33.801001 33.801001 0.0 0.0 0.0 220.974183
1 2015-01-03 03:30:00 189.646216 58.800553 399.717455 189.646216 189.646216 0.0 0.0 0.0 0.0 ... -1.373629 -1.373629 -1.373629 33.618634 33.618634 33.618634 0.0 0.0 0.0 221.460580

2 rows × 73 columns

Before calculating the optimal inventory policy, let’s visualize the forecasted sales data. To have a feel for the seasonalities and cycles in the forecasted sales data

Visualizing Forecasted Sales

import warnings

# Ignore the specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

# Plot the forecast
tuned_prophet.plot_components(new_forecast)

Business Question #1

gif

What is the demand forecast for the top selling product in the next 24 months?

The sales trend between 2015 and 2017 marks a cycle where sales for the product remained relatively stable during the second and third quarters of each year and then dipped slightly in October with a sharp increase between November and December.

The forecasted sales for the next 24 months, on the other hand, indicate a very stable demand pattern.

The zero variance observed in the product price may account for the relatively stable sales pattern forecasted for 2018 and 2019. It might also be worth investigating the factors that may account for the cyclical dips between 2015 and 2017

We can also observe the impact of the Puerto Rican holidays on the forecasted sales.

Finding Optimal Inventory Policy Based on Forecasted Demand

Now that we have forecasted the demand for the top-selling product, we can use the forecasted demand to calculate the optimal inventory policy for the product.

Finding the optimal inventory policy will help us determine the Reorder Point, safety stock, and Economic Order Quantity(EOQ) for the product. These markers will help us ensure that we have the right amount of inventory on hand to meet customer demand while minimizing inventory costs.

Re Order Point

The reorder point is the inventory level at which we should reorder more stock. ROP is calculated as the product of the average sales per day and the lead time (also referred to as Lead Time Demand) plus the Safety stock.

\[ Reorder\ Point = \text{Lead\ Time\ Demand} + \text{Safety\ Stock} \]

let’s Find the Lead Time Demand

\[ \text{Lead Time Demand} = \text{Average Sales Per Day} \times \text{Lead Time} \]

NOTE: It is important to state the difference between the downstream and upstream lead time.

The Lead Time with regards to the reordering point is the time it takes for a product to be delivered by upstream supply chain manufacturers to the store’s warehouse. It is the time between the placement of an order by the store and the receipt of the product.

The current dataset only provides the Days for shipment (scheduled) and the Days for shipping (real) which only helps us determine downstream lead times. The downstream lead time is the time it takes for a product to be delivered to the customer after it has been ordered. This is not the lead time we need to calculate the reorder point.

For the purposes of our analysis we will assume an average upstream Lead time of 7 days


# Extract average forecasted sales per day
average_forecasted_sales = new_forecast['yhat'].mean()


# Extract the average lead time
average_lead_time = 7  # 7 days
print(f"Average Lead Time: {average_lead_time}")

lead_time_demand = average_forecasted_sales * average_lead_time
print(f"Lead Time Demand: {lead_time_demand}")
Average Lead Time: 7
Lead Time Demand: 1469.1331793554164

One final piece to the Reorder Point puzzle is the Safety Stock. The safety stock is the extra stock that is kept on hand to mitigate the risk of stockouts due to uncertainties in demand and lead time.

\[ \text{Safety Stock} = (\text{Maximum Daily Sales} \times \text{Maximum Lead Time}) - \text{Lead Time Demand}\]

Let’s also assume that there have been delays from the manufacturer in the past and the maximum lead time is 10 days. That is Three days later than the average order fulfillment timeline

# find maximum daily forecasted sales
max_daily_forecasted_sales = new_forecast['yhat'].max()
print(f"Maximum Daily Forecasted Sales: {max_daily_forecasted_sales}")

# find maximum lead time
max_lead_time = average_lead_time + 3  # 3 days delays in delivery than the average
print(f"Maximum Lead Time: {max_lead_time}")

# calculate safety stock
safety_stock = (max_daily_forecasted_sales * max_lead_time) - lead_time_demand
print(f"Safety Stock: {safety_stock}")
Maximum Daily Forecasted Sales: 375.38211369017046
Maximum Lead Time: 10
Safety Stock: 2284.687957546288

Finally, we can calculate the reorder point for the top-selling product.

Putting It All Together

# calculate reorder point
reorder_point = lead_time_demand + safety_stock
print(f"The Optimal Reorder Point for the Top-selling Product is: {reorder_point}")
The Optimal Reorder Point for the Top-selling Product is: 3753.8211369017044

As indicated by the result, the reorder point for the top-selling product is 3753 units, which means that we should reorder more stock when the inventory level reaches 3753 units. This will help us ensure that we have enough stock on hand to meet customer demand while minimizing inventory costs.

Economic Order Quantity (EOQ)

Alternatively, we can use the Economic Order Quantity (EOQ) model to calculate the optimal order quantity for the top-selling product. The EOQ model helps us determine the optimal order quantity that minimizes the total inventory costs.

Unlike the Reorder Point which is concerned with determining the level of inventory at which a new order should be placed to avoid stockouts, EOQ takes into account the costs of ordering (e.g., setup costs) and the costs, of holding inventory (e.g., storage costs, opportunity costs).

The Economic Order Quantity (EOQ) formula is given by:

\[ EOQ = \sqrt{\frac{2DS}{H}} \]

Where:

This formula helps in determining the optimal order quantity that minimizes the total inventory costs.

NOTE: We can figure out the demand rate ( D ) based on the existing data. However, the ordering cost ( S ) and holding cost ( H ) are not provided in the dataset. For our analysis, We will assume that the ordering and holding cost is 10% and 30% of the product price respectively.

Estimating Holding Cost and Ordering Cost

Holding Cost

Holding costs typically include expenses related to storing inventory, such as warehousing, insurance, and security. It also includes the opportunity cost of tying up capital in inventory.

Ordering Cost

Ordering costs are the expenses incurred in the process of ordering and receiving inventory. This includes the cost of preparing and processing purchase orders, receiving and inspecting the goods, and storing and managing the inventory.

# extract the product price of top selling product(product card id:365)
product_price = df[df['Product Card Id'] == 365]['Product Price'].iloc[0]
print(f"The Product Price is: {product_price}")
# calculate holding cost
H = 0.10 * product_price
# calculate ordering cost
S = 0.30 * product_price

# calculate forecasted demand rate
D = new_forecast['yhat'].mean()

print(f"The Demand Rate is: {D}")
print(f"The Holding Cost is: {H}")
print(f"The Ordering Cost is: {S}")
The Product Price is: 59.99000168
The Demand Rate is: 209.8761684793452
The Holding Cost is: 5.999000168
The Ordering Cost is: 17.997000504

Putting It All Together

EOQ = math.sqrt((2 * D * S) / H)
print(f"The Economic Order Quantity is: {EOQ}")
The Economic Order Quantity is: 35.48601148165388

Based on the EOQ model, the optimal order quantity for the top-selling product is 35 units. This means that we should order 35 units of the product at a time to minimize the total inventory costs. This will help us ensure that we have the right amount of inventory on hand to meet customer demand while minimizing inventory costs.

Business Question #2

gif

See Angry Jack(from spongebob) being angry about his Massive inventory buildup. soon we will see how he could use our new inventory optimisations.

What is the optimal inventory level for the top selling product?

The optimal inventory policy for the top-selling product is as follows:

  • Reorder Point: 3753 units
  • Economic Order Quantity (EOQ): 35 units
  • Safety Stock: 2284 units

Here’s what Angry Jack should do:

When the stock level of the top-selling product hits 3753 units, He needs to place an order of 35 units with his suppliers.

Inventory management decisions based on these markers will help the company ensure that there is the right amount of inventory of the top-selling product on hand to meet customer demand while minimizing inventory costs.

Bonus: Investigating the Top Predictors of Demand Outcomes

Now that we have forecasted the demand for the top-selling product and calculated the optimal inventory policy, we can investigate the top predictors of demand outcomes. This will help us understand the factors that contribute to demand patterns and identify opportunities to improve sales outcomes.

The top-selling product price has remained fixed throughout the dataset. We can safely rule out the product price as a predictor of demand outcome for the period under review.

Wait….I have a Hypothesis

I have a hunch that Product Lead Time, Customer Segment and some Geographic factors will be top predictors of demand outcomes.

I will test this hypothesis by conducting a feature importance analysis using the Random Forest algorithm.

Let’s Investigate…..

gif

Correlation Analysis

Before we proceed with the feature importance analysis, let’s conduct a correlation analysis to identify the top predictors of demand outcomes.

we will create the correlation matrix using the corr() method and then use the heatmap() function from the seaborn library to visualize the correlation matrix.

We will use the label_encode_df data frame created during the data preprocessing stage.

# correlation analysis
correlation_matrix = label_encode_df.corr()

# plot the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix of Selected Variables')
plt.show()

Observation

The Heatmap shows there are no strong correlations between the features and the target variable. This suggests that the demand for the top-selling product is not strongly influenced by any single feature.

Feature Importance Analysis Using Random Forest Regressor Model

The correlation matrix could not identify the top predictors of demand outcomes. We will use the Random Forest algorithm to identify the top predictors of demand outcomes. The Random Forest algorithm is an ensemble learning method that uses multiple decision trees to make predictions. It is a powerful algorithm for feature importance analysis.

We will train a Random Forest Regressor model on the onehot_encode_df data frame and then use the feature_importances_ attribute of the model to identify the top predictors of demand outcomes.

Split the Data for Model Training
Show the code
from sklearn.model_selection import train_test_split

# prepare features excluding the sale
X_features = onehot_encode_df.drop(columns=['Sales'])
#  Drop Shipping year
X_features = X_features.drop(columns=['Shipping Year'])
# prepare target variable
y_target = onehot_encode_df['Sales']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2, random_state=42)
Train the Random Forest Regressor Model
Show the code
from sklearn.ensemble import RandomForestRegressor

# create a random forest regressor model
rf_model = RandomForestRegressor(n_estimators=150, max_depth=10,  min_samples_split=2)

# fit rf odel to the training data
rf_model.fit(X_train, y_train)
RandomForestRegressor(max_depth=10, n_estimators=150)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Feature Importance Analysis

Show the code
import matplotlib.cm as cm
# retrieve the feature importances
feature_importances = rf_model.feature_importances_

# get the top 5 feature importances
top_5_feature_importances = feature_importances.argsort()[-5:]

# visualize the top 5 features of importance
plt.figure(figsize=(10, 6))
colors = cm.viridis(np.linspace(0, 1, 
                                len(top_5_feature_importances)))
plt.barh(X_train.columns[top_5_feature_importances], 
         feature_importances[top_5_feature_importances],
          color=colors)
plt.xlabel('Feature Importance')

plt.title('Top 5 Feature Importances')
plt.show()

Discussion of Results

The feature importance analysis confirms the majority of my hypothesis

We can see that Product Lead Time, and some Geographic factors were among the top predictors of demand outcomes

Customer Segment however was not among the top predictors of demand outcomes. This brings us to the end of our analysis

Congratulations on making it This Far!

gif

Key Takeaways

If you made it this far, I hope you have enjoyed the journey. Here are the key takeaways from this project:

How does this method offer enhanced scalability compared to conventional demand forecasting techniques?

  1. The existing data processing steps can easily be adapted to accommodate larger product categories.
  2. Improved speed and accuracy.
  3. Minimized guesswork and enhanced reproducibility for validating forecasts.

Feel free to share your thoughts and feedback on my approach. I am open to learning and improving my skills.

Up Next: