Online Retail Dataset: RFM Modeling with K-Means Clustering¶

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Data Description:¶

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.

UnitPrice: Unit price. Numeric, Product price per unit in sterling.

CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal, the name of the country where each customer resides.

Problem Statement:¶

It is a business critical requirement to understand the value derived from a customer. RFM is a method used for analyzing customer value.

Perform customer segmentation using RFM analysis. The resulting segments can be ordered from most valuable (highest recency, frequency, and value) to least valuable (lowest recency, frequency, and value). Identifying the most valuable RFM segments can capitalize on chance relationships in the data used for this analysis.

Import python libraries from PyData stack.

In [1]:
# for Extracting, Transforming, Loading Data--as well as--Processing, Analyzing, and Visualizing Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# for Statistical Analysis
from scipy import stats

# for Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

import feature_engine
from feature_engine.outliers import Winsorizer


# for Sane and Clear Outputs
import warnings
warnings.filterwarnings("ignore")

Approach:¶

The following cell is commented out as it only needs to be run once to create csv versions of the given data. Processing in pandas using the excel format takes up unneccesary memory and slows down processing (I think?).

In [2]:
# excel = pd.read_excel('Online Retail.xlsx')
# excel.to_csv('retail.csv')

# excel = pd.read_excel('train.xlsx')
# excel.to_csv('train.csv')

# excel = pd.read_excel('test.xlsx')
# excel.to_csv('test.csv')

The 'ISO' encoding serves the same aforementioned space saving purpose. The default would be 'UTF-8' which is multi-byte designations for individual characters in the csv; ISO reduces them to single-byte characters.

In [3]:
df = pd.read_csv('retail.csv', encoding='ISO-8859-1',
                 index_col=0, parse_dates=['InvoiceDate'])
df.head()
Out[3]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
In [4]:
df.tail()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 2011-12-09 12:50:00 0.85 12680.0 France
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011-12-09 12:50:00 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011-12-09 12:50:00 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011-12-09 12:50:00 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011-12-09 12:50:00 4.95 12680.0 France
  1. Perform a preliminary data inspection and Data cleaning

    a. Check for missing data and formulate apt strategy to treat them.

    b. Are there any duplicate data records? Remove them if present.

    c. Perform Descriptive analytics on the given data.

Preprocessing¶

The following function is just a personal preference I've used on past projects/assignments as I find it to be more robust than `df.info()`. That being said, the `min` and `max` columns were a recent addition due to initially ignoring negative values from this dataset.

In [5]:
def summary(df):
    obs = df.shape[0]
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: x.unique().shape[0])
    nulls = df.apply(lambda x: x.isnull().sum())
    min = df.min()
    max = df.max()
    print('Data shape:', df.shape)

    cols = ['types', 'counts', 'uniques', 'nulls', 'min', 'max']
    summary = pd.concat([types, counts, uniques, nulls,
                        min, max], axis=1, sort=True)

    summary.columns = cols
    dtypes = summary.types.value_counts()
    print('___________________________\nData types:')
    print(summary.types.value_counts())
    print('___________________________')
    return summary
In [6]:
details = summary(df)
display(details.sort_values(by='nulls', ascending=False))
Data shape: (541909, 8)
___________________________
Data types:
object            4
float64           2
datetime64[ns]    1
int64             1
Name: types, dtype: int64
___________________________
types counts uniques nulls min max
CustomerID float64 406829 4373 135080 12346.0 18287.0
Description object 540455 4224 1454 NaN NaN
Country object 541909 38 0 Australia Unspecified
InvoiceDate datetime64[ns] 541909 23260 0 2010-12-01 08:26:00 2011-12-09 12:50:00
InvoiceNo object 541909 25900 0 536365 C581569
Quantity int64 541909 722 0 -80995 80995
StockCode object 541909 4070 0 10002 m
UnitPrice float64 541909 1630 0 -11062.06 38970.0
In [7]:
df.describe()
Out[7]:
Quantity UnitPrice CustomerID
count 541909.000000 541909.000000 406829.000000
mean 9.552250 4.611114 15287.690570
std 218.081158 96.759853 1713.600303
min -80995.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13953.000000
50% 3.000000 2.080000 15152.000000
75% 10.000000 4.130000 16791.000000
max 80995.000000 38970.000000 18287.000000

Missing values

Almost 25% of our data lacks a **CustomerID** and must be dropped as it cannot be included in our RFM model.

In [8]:
df.isnull().mean()
Out[8]:
InvoiceNo      0.000000
StockCode      0.000000
Description    0.002683
Quantity       0.000000
InvoiceDate    0.000000
UnitPrice      0.000000
CustomerID     0.249267
Country        0.000000
dtype: float64

But as we'll see later, those undocumented customers come from our primary data source--the UK--and could reflect cash payments. So no one needs to be fired... yet.

In [9]:
df[df.isnull().any(axis=1)].Country.value_counts(normalize=True)
Out[9]:
United Kingdom    0.989044
EIRE              0.005264
Hong Kong         0.002132
Unspecified       0.001495
Switzerland       0.000925
France            0.000489
Israel            0.000348
Portugal          0.000289
Bahrain           0.000015
Name: Country, dtype: float64

Duplicate records

Almost 1% of our data consists of duplicate or redundant records.

In [10]:
df.duplicated().mean()
Out[10]:
0.009721189350979592

Credited Accounts

In [11]:
df.loc[(df.UnitPrice < 0)]
Out[11]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom

Cancelled Orders

In [12]:
df.loc[(df.Quantity < 0)]
Out[12]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0 United Kingdom
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0 United Kingdom
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom
... ... ... ... ... ... ... ... ...
540449 C581490 23144 ZINC T-LIGHT HOLDER STARS SMALL -11 2011-12-09 09:57:00 0.83 14397.0 United Kingdom
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498.0 United Kingdom
541715 C581568 21258 VICTORIAN SEWING BOX LARGE -5 2011-12-09 11:57:00 10.95 15311.0 United Kingdom
541716 C581569 84978 HANGING HEART JAR T-LIGHT HOLDER -1 2011-12-09 11:58:00 1.25 17315.0 United Kingdom
541717 C581569 20979 36 PENCILS TUBE RED RETROSPOT -5 2011-12-09 11:58:00 1.25 17315.0 United Kingdom

10624 rows × 8 columns

91% of our transactions are from the UK so we may want to consider isolating it and analyzing it separately.

In [13]:
df.Country.value_counts(normalize=True)
Out[13]:
United Kingdom          0.914320
Germany                 0.017521
France                  0.015790
EIRE                    0.015124
Spain                   0.004674
Netherlands             0.004375
Belgium                 0.003818
Switzerland             0.003694
Portugal                0.002803
Australia               0.002323
Norway                  0.002004
Italy                   0.001482
Channel Islands         0.001399
Finland                 0.001283
Cyprus                  0.001148
Sweden                  0.000853
Unspecified             0.000823
Austria                 0.000740
Denmark                 0.000718
Japan                   0.000661
Poland                  0.000629
Israel                  0.000548
USA                     0.000537
Hong Kong               0.000531
Singapore               0.000423
Iceland                 0.000336
Canada                  0.000279
Greece                  0.000269
Malta                   0.000234
United Arab Emirates    0.000125
European Community      0.000113
RSA                     0.000107
Lebanon                 0.000083
Lithuania               0.000065
Brazil                  0.000059
Czech Republic          0.000055
Bahrain                 0.000035
Saudi Arabia            0.000018
Name: Country, dtype: float64

Drop duplicates

In [14]:
df = df.drop_duplicates()

Ignore Irrelevant Data

In [15]:
df = df[df.Quantity > 0]
df = df[df.UnitPrice > 0]

Drop null **CustomerID**'s / Keep null **Description**'s

In [16]:
df = df[pd.notnull(df['CustomerID'])]
df['CustomerID'] = df['CustomerID'].astype('int')

Add simplified **Date** column to groupby

In [17]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Date'] = df['InvoiceDate'].dt.strftime('%Y-%m-%d')
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = df['Date'].dt.strftime('%Y-%m')

*Sanity Check*

In [18]:
df.duplicated().mean()
Out[18]:
0.0
In [19]:
df.isnull().mean()
Out[19]:
InvoiceNo      0.0
StockCode      0.0
Description    0.0
Quantity       0.0
InvoiceDate    0.0
UnitPrice      0.0
CustomerID     0.0
Country        0.0
Date           0.0
dtype: float64

Add **Revenue** column

In [20]:
df['Revenue'] = df['Quantity']*df['UnitPrice']
df.to_csv('eda.csv')

Exploratory Data Analysis¶

In [21]:
plt.figure(figsize=(12, 12))
sns.barplot(x=df.Date.unique(), y=df.Date.value_counts(
    sort=False), palette='viridis')
plt.title("Invoices per Month")
plt.show()
In [22]:
top_products = df['Description'].value_counts()[:25]
plt.figure(figsize=(12, 12))
sns.barplot(y=top_products.index,
            x=top_products.values,
            palette='viridis_r')
plt.title("Top selling products")
plt.show()

Cohort Analysis¶

  1. Cohort Analysis: A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts.

    a. Create month cohorts and analyse active customers for each cohort.

    b. Also Analyse the retention rate of customers. Comment.

Types of cohorts:

  • Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.
  • Behavior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.
  • Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.

For cohort analysis, there are a few labels that we have to create:

  • Invoice period - A string representation of the year and month of a single transaction/invoice.
  • Cohort group:- A string representation of the the year and month of a customer’s first purchase. This label is common across all invoices for a particular customer.
  • Cohort period/Index-  A integer representation a customer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase.</span>
In [23]:
# Create a copy of dataframe for cohort analysis
cohort = df.copy()

Assign monthly acquisition cohort

Defining a cohort is the first step to cohort analysis. We will now create monthly cohorts based on the month each customer has made their first transaction.

In [24]:
# Define a function that will parse the date
def get_month(x):
    return dt.datetime(x.year, x.month, 1)


# Create InvoiceMonth column
cohort['InvoiceMonth'] = cohort['InvoiceDate'].apply(get_month)

# Group by CustomerID and select the InvoiceMonth value
grouping = cohort.groupby('CustomerID')['InvoiceMonth']

# Assign a minimum InvoiceMonth value to the dataset
cohort['CohortMonth'] = grouping.transform('min')

Calculate time offset in months

Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.

First, we will create some variables that capture the integer value of years and months for Invoice and Cohort Date using the get_date_int() function

In [25]:
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    return year, month


# Get the integers for date parts from the `InvoiceMonth` column
invoice_year, invoice_month = get_date_int(cohort, 'InvoiceMonth')

# Get the integers for date parts from the `CohortMonth` column
cohort_year, cohort_month = get_date_int(cohort, 'CohortMonth')

# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Extract the difference in months from all previous values
cohort['CohortIndex'] = years_diff * 12 + months_diff + 1

cohort.head()
Out[25]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Date Revenue InvoiceMonth CohortMonth CohortIndex
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 2010-12 15.30 2010-12-01 2010-12-01 1
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12 20.34 2010-12-01 2010-12-01 1
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 2010-12 22.00 2010-12-01 2010-12-01 1
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12 20.34 2010-12-01 2010-12-01 1
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 2010-12 20.34 2010-12-01 2010-12-01 1

Calculate retention rate

Customer retention is a very useful metric to understand how many of all the customers are still active. It gives you the percentage of active customers compared to the total number of customers

In [26]:
grouping = cohort.groupby(['CohortMonth', 'CohortIndex'])

# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()

# Create a pivot
cohort_counts = cohort_data.pivot(
    index='CohortMonth', columns='CohortIndex', values='CustomerID')

# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:, 0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)*100

month_list = ["Dec '10", "Jan '11", "Feb '11", "Mar '11", "Apr '11",
              "May '11", "Jun '11", "Jul '11", "Aug '11", "Sep '11",
              "Oct '11", "Nov '11", "Dec '11"]

plt.figure(figsize=(12, 12))
sns.heatmap(data=retention,
            annot=True,
            vmin=0.0,
            cmap='viridis',
            vmax=list(retention.max().sort_values(ascending=False))[1]+3,
            fmt='.1f',
            linewidth=0.3,
            yticklabels=month_list)
plt.show()

Calculate average price per cohort

Now we will calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.

In [27]:
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = cohort.groupby(['CohortMonth', 'CohortIndex'])

# Calculate the average of the unit price column
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot
average_price = cohort_data.pivot(
    index='CohortMonth', columns='CohortIndex', values='UnitPrice')
average_price.round(1)
average_price.index = average_price.index.date

plt.figure(figsize=(12, 12))
sns.heatmap(data=average_price,
            annot=True,
            vmin=0.0,
            cmap='viridis',
            vmax=list(average_price.max().sort_values(ascending=False))[1]+3,
            fmt='.1f',
            linewidth=0.3,
            yticklabels=month_list)
plt.show()

Calculate average quantity per cohort

Now we will calculate the average quantity metric and analyze if there are any differences in shopping patterns across time and across cohorts.

In [28]:
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = cohort.groupby(['CohortMonth', 'CohortIndex'])

# Calculate the average of the Quantity column
cohort_data = grouping['Quantity'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot
average_quantity = cohort_data.pivot(
    index='CohortMonth', columns='CohortIndex', values='Quantity')

plt.figure(figsize=(12, 12))
sns.heatmap(data=average_quantity,
            annot=True,
            vmin=0.0,
            cmap='viridis',
            vmax=list(average_quantity.max().sort_values(
                ascending=False))[1]+3,
            fmt='.1f',
            linewidth=0.3,
            yticklabels=month_list)
plt.show()
In [29]:
dummies = pd.get_dummies(df['Date'])
df = df.join(dummies)
In [30]:
all_months = df.groupby('Date').agg({'CustomerID': 'nunique', 'Quantity': 'sum', 'Revenue': 'sum', '2010-12': 'sum',
                                     '2011-01': 'sum', '2011-02': 'sum', '2011-03': 'sum', '2011-04': 'sum', '2011-05': 'sum', '2011-06': 'sum',
                                     '2011-07': 'sum', '2011-08': 'sum', '2011-09': 'sum', '2011-10': 'sum', '2011-11': 'sum', '2011-12': 'sum'})

all_customers = df.groupby('CustomerID').agg({'Date': ['min', 'max', 'nunique'], 'Quantity': 'sum', 'Revenue': 'sum', '2010-12': 'sum',
                                              '2011-01': 'sum', '2011-02': 'sum', '2011-03': 'sum', '2011-04': 'sum', '2011-05': 'sum', '2011-06': 'sum',
                                              '2011-07': 'sum', '2011-08': 'sum', '2011-09': 'sum', '2011-10': 'sum', '2011-11': 'sum', '2011-12': 'sum'})

RFM Model¶

  1. Build a RFM model – Recency Frequency and Monetary based on their behaviour.

    Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.

    Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value.

    Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.

    a. Calculate RFM metrics.

     i. Recency as the time in no. of days since last transaction  
     ii. Frequency as count of purchases done  
     iii. Monetary value as total amount spend  
    

    b. Build RFM Segments.

     i. Give Recency Frequency and Monetary scores individually by
     dividing them in to quartiles.  
    
     Note: Rate "Recency" for customer who have been active more recently
     better than the less recent customer, because each company wants its
     customers to be recent  
    
     Rate "Frequency" and "Monetary Value" higher label because we want
     Customer to spend more money and visit more often.  
    
     ii. Combine three ratings to get a RFM segment (as strings)  
     iii. Get the RFM score by adding up the three ratings.  
    

    c. Analyse the RFM Segments by summarizing them and comment on the findings.

Benefits of RFM analysis

  • Increased customer retention
  • Increased response rate
  • Increased conversion rate
  • Increased revenue

To perform RFM analysis, we divide customers into four equal groups according to the distribution of values for recency, frequency, and monetary value. Four equal groups across three variables create 64 (4x4x4) different customer segments, which is a manageable number.

For example, let’s look at a customer who:

  • is within the group who purchased most recently (R=4),
  • is within the group who purchased most quantity (F=4),
  • is within the group who spent the most (M=4)

This customer belongs to RFM segment 4-4-4 (Best Customers), (R=4, F=4, M=4)

Below is a table with key RFM segments:

Segment RFM Description Marketing
Best Customers 444 Customers who bought most recently, most often, and spent the most. New products and Loyalty programs
Loyal Customers 344 Customers who bought most recently. Use R and M to further segment
Big Spenders 334 Customers who spent the most. Market most expensive products
Almost Lost 244 Haven't purchased recently, but did make frequent purchases at high cost. Aggressive price incentives
Lost Customers 144 Haven't purchased in some time, but used to make frequent purchases at high cost. Aggressive price incentives
Lost Cheap Customers 122 Haven't purchased in some time; did not purchase often or spend very much. Don't spend too much to reacquire

Recency

Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.

In [32]:
df['InvoiceDate'].max()
Out[32]:
Timestamp('2011-12-09 12:50:00')
In [33]:
# Lets set this date as the today's date for further analysis
current_date = dt.date(2011, 12, 9)

# Lets create a date column for date values only
df['Purchase_Date'] = df.InvoiceDate.dt.date

recency = df.groupby('CustomerID')['Purchase_Date'].max().reset_index()

# Create a separate column for this date.
recency = recency.assign(Current_Date=current_date)

# Compute the number of days since last purchase
recency['Recency'] = recency.Purchase_Date.apply(
    lambda x: (current_date - x).days)

recency.head()
Out[33]:
CustomerID Purchase_Date Current_Date Recency
0 12346 2011-01-18 2011-12-09 325
1 12347 2011-12-07 2011-12-09 2
2 12348 2011-09-25 2011-12-09 75
3 12349 2011-11-21 2011-12-09 18
4 12350 2011-02-02 2011-12-09 310
In [34]:
# Drop the irrelevant Date columns
recency.drop(['Purchase_Date', 'Current_Date'], axis=1, inplace=True)

Frequency

Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value

In [35]:
frequency = df.groupby('CustomerID').InvoiceNo.nunique(
).reset_index().rename(columns={'InvoiceNo': 'Frequency'})

frequency.head()
Out[35]:
CustomerID Frequency
0 12346 1
1 12347 7
2 12348 4
3 12349 1
4 12350 1

Monetary

Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.

In [36]:
# Create a separate column for Total Cost of Unit purchased
df['Total_cost'] = df.Quantity * df.UnitPrice

monetary = df.groupby('CustomerID').Total_cost.sum(
).reset_index().rename(columns={'Total_cost': 'Monetary'})

monetary.head()
Out[36]:
CustomerID Monetary
0 12346 77183.60
1 12347 4310.00
2 12348 1797.24
3 12349 1757.55
4 12350 334.40

Now Combine all three to form an aggregated RFM Table

In [37]:
temp = recency.merge(frequency, on='CustomerID')
rfm_table = temp.merge(monetary, on='CustomerID')
rfm_table.head()
Out[37]:
CustomerID Recency Frequency Monetary
0 12346 325 1 77183.60
1 12347 2 7 4310.00
2 12348 75 4 1797.24
3 12349 18 1 1757.55
4 12350 310 1 334.40

Customer segments with RFM Model

The simplest way to create customers segments from RFM Model is to use Quantiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.

In [38]:
# RFM Quantiles
quantiles = rfm_table.quantile(q=[0.25, 0.5, 0.75])

# Let's convert quartile information into a dictionary so that cutoffs can be picked up.
quantiles = quantiles.to_dict()

quantiles
Out[38]:
{'CustomerID': {0.25: 13813.25, 0.5: 15299.5, 0.75: 16778.75},
 'Recency': {0.25: 17.0, 0.5: 50.0, 0.75: 141.75},
 'Frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 5.0},
 'Monetary': {0.25: 306.48249999999996, 0.5: 668.57, 0.75: 1660.5974999999999}}

Creation of RFM Segments

We will create two segmentation classes since, high recency is bad, while high frequency and monetary value is good

In [39]:
# Arguments (x = value, p = recency, monetary_value, frequency, d = quantiles dict)
def RScore(x, p, d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1
In [40]:
# Arguments (x = value, p = recency, monetary_value, frequency, k = quantiles dict)
def FMScore(x, p, d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4
In [41]:
rfm_segment = rfm_table.copy()
rfm_segment['R_Quartile'] = rfm_segment['Recency'].apply(
    RScore, args=('Recency', quantiles,))
rfm_segment['F_Quartile'] = rfm_segment['Frequency'].apply(
    FMScore, args=('Frequency', quantiles,))
rfm_segment['M_Quartile'] = rfm_segment['Monetary'].apply(
    FMScore, args=('Monetary', quantiles,))
rfm_segment.head()
Out[41]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile
0 12346 325 1 77183.60 1 1 4
1 12347 2 7 4310.00 4 4 4
2 12348 75 4 1797.24 2 3 4
3 12349 18 1 1757.55 3 1 4
4 12350 310 1 334.40 1 1 2
In [42]:
rfm_segment['RFM'] = rfm_segment.R_Quartile.map(
    str) + rfm_segment.F_Quartile.map(str) + rfm_segment.M_Quartile.map(str)
rfm_segment['Score'] = rfm_segment[[
    'R_Quartile', 'F_Quartile', 'M_Quartile']].sum(axis=1)
rfm_segment.head()
Out[42]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score
0 12346 325 1 77183.60 1 1 4 114 6
1 12347 2 7 4310.00 4 4 4 444 12
2 12348 75 4 1797.24 2 3 4 234 9
3 12349 18 1 1757.55 3 1 4 314 8
4 12350 310 1 334.40 1 1 2 112 4
  • Best Recency score = 4 (most recently purchase)
  • Best Frequency score = 4 (most frequently purchase)
  • Best Monetary score = 4 (who spent the most)

RFM Segment allocation

Lets define the customers segment best to our knowledge basis RFM score and assign them to each customer respectively.

In [43]:
# Create a dictionary for each segment to map them against each customer
segment_dict = {
    # Highest frequency as well as monetary value with least recency
    'Best Customers': '444',
    # High frequency as well as monetary value with good recency
    'Loyal Customers': '344',
    # High monetary value but good recency and frequency values
    'Big Spenders': '334',
    # Customer's shopping less often now who used to shop a lot
    'Almost Lost': '244',
    # Customer's shopped long ago who used to shop a lot.
    'Lost Customers': '144',
    # Customer's who recently started shopping a lot but with less monetary value
    'Recent Customers': '443',
    # Customer's shopped long ago but with less frequency and monetary value
    'Lost Cheap Customers': '122'
}

# Swap the key and value of dictionary
dict_segment = dict(zip(segment_dict.values(), segment_dict.keys()))

# Allocate segments to each customer as per the RFM score mapping
rfm_segment['Segment'] = rfm_segment.RFM.map(lambda x: dict_segment.get(x))

# Allocate all remaining customers to others segment category
rfm_segment.Segment.fillna('others', inplace=True)

rfm_segment.sample(10)
Out[43]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
2643 15948 8 2 955.24 4 2 3 423 9 others
3856 17628 81 1 75.75 2 1 1 211 4 others
3572 17232 2 2 417.77 4 2 2 422 8 others
1083 13812 42 3 539.00 3 3 2 332 8 others
3867 17643 373 1 101.55 1 1 1 111 3 others
12 12359 57 4 6310.03 2 3 4 234 9 others
2138 15258 168 2 623.16 1 2 2 122 5 Lost Cheap Customers
259 12664 8 9 4881.88 4 4 4 444 12 Best Customers
699 13273 113 1 138.98 2 1 1 211 4 others
292 12707 291 1 603.42 1 1 2 112 4 others

Let's visualize different customer segments records in general to answers these questions for the retail business.

  • Who are my best customers?
  • Who are the biggest spenders?
  • Which customers are at the verge of churning?
  • Who are lost customers that you don’t need to pay much attention to?
  • Who are your loyal customers?
  • Which customers you must retain?
  • Who has the potential to be converted in more profitable customers?
  • Which group of customers is most likely to respond to your current campaign?
In [44]:
# Best Customers who's recency, frequency as well as monetary attribute is highest.
rfm_segment[rfm_segment.RFM == '444'].sort_values(
    'Monetary', ascending=False).head()
Out[44]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
1689 14646 1 73 280206.02 4 4 4 444 12 Best Customers
4201 18102 0 60 259657.30 4 4 4 444 12 Best Customers
3728 17450 8 46 194390.79 4 4 4 444 12 Best Customers
1879 14911 1 201 143711.17 4 4 4 444 12 Best Customers
1333 14156 9 55 117210.08 4 4 4 444 12 Best Customers
In [45]:
# Biggest spenders
rfm_segment[rfm_segment.RFM == '334'].sort_values(
    'Monetary', ascending=False).head()
Out[45]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
152 12536 43 3 12601.83 3 3 4 334 10 Big Spenders
2773 16126 29 4 6287.77 3 3 4 334 10 Big Spenders
729 13316 37 4 5732.93 3 3 4 334 10 Big Spenders
154 12539 22 4 5568.35 3 3 4 334 10 Big Spenders
2899 16303 25 3 5360.63 3 3 4 334 10 Big Spenders
In [46]:
# Almost Lost i.e. who's recency value is low
rfm_segment[rfm_segment.RFM == '244'].sort_values(
    'Monetary', ascending=False).head()
Out[46]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
324 12744 51 7 21279.29 2 4 4 244 10 Almost Lost
459 12939 64 8 11581.80 2 4 4 244 10 Almost Lost
2814 16180 100 8 10254.18 2 4 4 244 10 Almost Lost
1903 14952 59 11 8099.49 2 4 4 244 10 Almost Lost
3222 16745 86 17 7180.70 2 4 4 244 10 Almost Lost
In [47]:
# Lost customers that don't needs attention who's recency, frequency as well as monetary values are low
rfm_segment[rfm_segment.RFM == '122'].sort_values(
    'Monetary', ascending=False).head()
Out[47]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
3487 17105 159 2 665.78 1 2 2 122 5 Lost Cheap Customers
4216 18121 149 2 653.30 1 2 2 122 5 Lost Cheap Customers
1621 14548 150 2 652.80 1 2 2 122 5 Lost Cheap Customers
2229 15384 169 2 642.41 1 2 2 122 5 Lost Cheap Customers
4286 18218 211 2 641.92 1 2 2 122 5 Lost Cheap Customers
In [48]:
# Loyal customers who's purchase frequency is high
rfm_segment[rfm_segment.RFM == '344'].sort_values(
    'Monetary', ascending=False).head()
Out[48]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
55 12415 24 21 124914.53 3 4 4 344 11 Loyal Customers
2702 16029 38 63 80850.84 3 4 4 344 11 Loyal Customers
453 12931 21 15 42055.96 3 4 4 344 11 Loyal Customers
1713 14680 25 16 28754.11 3 4 4 344 11 Loyal Customers
330 12753 22 6 21429.39 3 4 4 344 11 Loyal Customers
In [49]:
# Customers that you must retain are those whose monetary and frequency was high but recency reduced quite a lot recently
rfm_segment[rfm_segment.RFM == '244'].sort_values(
    'Monetary', ascending=False).head()
Out[49]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment
324 12744 51 7 21279.29 2 4 4 244 10 Almost Lost
459 12939 64 8 11581.80 2 4 4 244 10 Almost Lost
2814 16180 100 8 10254.18 2 4 4 244 10 Almost Lost
1903 14952 59 11 8099.49 2 4 4 244 10 Almost Lost
3222 16745 86 17 7180.70 2 4 4 244 10 Almost Lost

K-Means Clustering¶

  1. Create clusters using k means clustering algorithm.

    a. Prepare the data for the algorithm.

         i. If the data is asymmetrically distributed, manage the skewness with
         appropriate transformation.  
    
         ii. Standardize / scale the data.  
    
    

    b. Decide the optimum number of clusters to be formed

    c. Analyse these clusters and comment on the results.

In [50]:
# Function to Check Skewness
def check_skew(df_skew, column, color):
    skew = stats.skew(df_skew[column])
    skewtest = stats.skewtest(df_skew[column])
    plt.title('Distribution of ' + column)
    sns.distplot(df_skew[column], color=color)
    print("{}'s: Skew: {}, : {}".format(column, skew, skewtest))
    return
In [51]:
# Check Skewness
plt.figure(figsize=(12, 12))
plt.subplot(3, 1, 1)
check_skew(rfm_table, 'Recency', "dodgerblue")
plt.subplot(3, 1, 2)
check_skew(rfm_table, 'Frequency', "deeppink")
plt.subplot(3, 1, 3)
check_skew(rfm_table, 'Monetary', "gold")
plt.tight_layout()
Recency's: Skew: 1.2453948317057284, : SkewtestResult(statistic=26.60351236555474, pvalue=6.181721752536432e-156)
Frequency's: Skew: 12.062857869870964, : SkewtestResult(statistic=74.62743613377035, pvalue=0.0)
Monetary's: Skew: 19.332680144099353, : SkewtestResult(statistic=85.01187149828888, pvalue=0.0)
In [52]:
# Removing Skewness
table_scaled = rfm_table.copy()
RFM_log = np.log(table_scaled+1)
plt.figure(figsize=(12, 12))
plt.subplot(3, 1, 1)
check_skew(RFM_log, 'Recency', "dodgerblue")
plt.subplot(3, 1, 2)
check_skew(RFM_log, 'Frequency', "deeppink")
plt.subplot(3, 1, 3)
check_skew(RFM_log, 'Monetary', "gold")
plt.tight_layout()
Recency's: Skew: -0.4670749364137121, : SkewtestResult(statistic=-11.982131984493975, pvalue=4.408387035293672e-33)
Frequency's: Skew: 1.2082335351584435, : SkewtestResult(statistic=26.04793003945421, pvalue=1.419991761586644e-149)
Monetary's: Skew: 0.3964614244871878, : SkewtestResult(statistic=10.299963600725635, pvalue=7.048796791830502e-25)
In [53]:
rfm_table.describe()
Out[53]:
CustomerID Recency Frequency Monetary
count 4338.000000 4338.000000 4338.000000 4338.000000
mean 15300.408022 92.059474 4.272015 2048.688081
std 1721.808492 100.012264 7.697998 8985.230220
min 12346.000000 0.000000 1.000000 3.750000
25% 13813.250000 17.000000 1.000000 306.482500
50% 15299.500000 50.000000 2.000000 668.570000
75% 16778.750000 141.750000 5.000000 1660.597500
max 18287.000000 373.000000 209.000000 280206.020000

Winsorized Mean

The Winsorized mean is a method of averaging that initially replaces the smallest and largest values with the observations closest to them. This is done to limit the effect of outliers or abnormal extreme values, or outliers, on the calculation.

In [54]:
windsoriser = Winsorizer(tail='both',  # cap left, right or both tails
                         fold=2,
                         variables=['Recency', 'Frequency', 'Monetary']
                         )
windsoriser.fit(RFM_log)
Out[54]:
Winsorizer(fold=2, tail='both', variables=['Recency', 'Frequency', 'Monetary'])
In [55]:
RFM_log = windsoriser.transform(RFM_log)
In [56]:
scaler = StandardScaler()
scaler.fit(RFM_log)
RFM_scaled = scaler.transform(RFM_log)
RFM_scaled = pd.DataFrame(RFM_scaled, columns=RFM_log.columns)
RFM_scaled.head()
Out[56]:
CustomerID Recency Frequency Monetary
0 -1.833336 1.463293 -1.016221 2.146328
1 -1.832623 -2.020149 1.218827 1.522186
2 -1.831909 0.381349 0.461066 0.781226
3 -1.831196 -0.648681 -1.016221 0.762313
4 -1.830483 1.428294 -1.016221 -0.641811

Optimization

The number of clusters (k) is the most important hyperparameter in K-Means clustering. If we already know beforehand, the number of clusters to group the data into, then there is no use to tune the value of k.

If there is no idea about the optimal value of k, then there are various methods to find the optimal/best value of k. Here we will explore two such methods:

  • Elbow Method
  • Silhouette Method

The Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.

The Silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In [57]:
from scipy.spatial.distance import cdist
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(2, 12)

for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(RFM_scaled)
    kmeanModel.fit(RFM_scaled)

    distortions.append(sum(np.min(cdist(RFM_scaled, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / RFM_scaled.shape[0])
    inertias.append(kmeanModel.inertia_)

    mapping1[k] = sum(np.min(cdist(RFM_scaled, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / RFM_scaled.shape[0]
    mapping2[k] = kmeanModel.inertia_
In [58]:
plt.figure(figsize=(12, 12))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()
In [59]:
plt.figure(figsize=(12, 12))
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method with Inertia')
plt.show()
In [60]:
from sklearn.metrics import silhouette_score
wcss_silhouette = []
for k in K:
    km = KMeans(n_clusters=k, random_state=1).fit(RFM_scaled)
    preds = km.predict(RFM_scaled)
    silhouette = silhouette_score(RFM_scaled, preds)
    wcss_silhouette.append(silhouette)
    print("Silhouette score for number of cluster(s) {}: {}".format(k, silhouette))

plt.figure(figsize=(12, 12))
plt.title("The silhouette coefficient method \nfor determining number of clusters\n", fontsize=16)
plt.scatter(x=[i for i in range(2, 12)],
            y=wcss_silhouette, s=150, edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters", fontsize=14)
plt.ylabel("Silhouette score", fontsize=15)
plt.xticks([i for i in range(2, 12)], fontsize=14)
plt.yticks(fontsize=15)
plt.show()
Silhouette score for number of cluster(s) 2: 0.34305376691087064
Silhouette score for number of cluster(s) 3: 0.28105352172658254
Silhouette score for number of cluster(s) 4: 0.25648293481928397
Silhouette score for number of cluster(s) 5: 0.25293679864881496
Silhouette score for number of cluster(s) 6: 0.2569378664274715
Silhouette score for number of cluster(s) 7: 0.24420527452086455
Silhouette score for number of cluster(s) 8: 0.24235651303445258
Silhouette score for number of cluster(s) 9: 0.2373907626700457
Silhouette score for number of cluster(s) 10: 0.23511719757718036
Silhouette score for number of cluster(s) 11: 0.22863163520649685
In [61]:
def kmeans(normalised_df_rfm, clusters_number, original_df_rfm):

    kmeans = KMeans(n_clusters=clusters_number, random_state=1)
    kmeans.fit(normalised_df_rfm)

    # Extract cluster labels
    cluster_labels = kmeans.labels_

    # Create a cluster label column in original dataset
    df_new = original_df_rfm.assign(Cluster=cluster_labels)

    # Initialise TSNE
    model = TSNE(random_state=1)
    transformed = model.fit_transform(df_new)

    # Plot t-SNE
    plt.title('Flattened Graph of {} Clusters'.format(clusters_number))
    sns.scatterplot(x=transformed[:, 0], y=transformed[:, 1],
                    hue=cluster_labels, style=cluster_labels, palette="Set1")

    return df_new


plt.figure(figsize=(12, 12))
plt.subplot(3, 1, 1)
df_rfm_k3 = kmeans(RFM_scaled, 3, rfm_table)
plt.subplot(3, 1, 2)
df_rfm_k4 = kmeans(RFM_scaled, 4, rfm_table)
plt.subplot(3, 1, 3)
df_rfm_k5 = kmeans(RFM_scaled, 5, rfm_table)
plt.tight_layout()

Snake plots

  • Market research technique to compare different segments
  • Visual representation of each segment's attributes
  • Plot each cluster's average normalized values of each attribute

To plot this we should have normalized data distribution and all the attributes in a single column. We will use pandas melt facility on normal_df to achieve that

In [62]:
def snake_plot(normalised_df_rfm, df_rfm_kmeans, df_rfm_original):
    normalised_df_rfm = pd.DataFrame(normalised_df_rfm,
                                     index=df_rfm_original.index,
                                     columns=df_rfm_original.columns)
    normalised_df_rfm['Cluster'] = df_rfm_kmeans['Cluster']

    # Melt data into long format
    df_melt = pd.melt(normalised_df_rfm.reset_index(),
                      id_vars=['CustomerID', 'Cluster'],
                      value_vars=['Recency', 'Frequency', 'Monetary'],
                      var_name='Metric',
                      value_name='Value')
    plt.xlabel('Metric')
    plt.ylabel('Value')
    sns.pointplot(data=df_melt, x='Metric', y='Value',
                  hue='Cluster', palette="Set1")

    return
In [63]:
plt.figure(figsize=(12, 12))
plt.subplot(3, 1, 1)
snake_3 = snake_plot(RFM_scaled, df_rfm_k3, rfm_table)
plt.subplot(3, 1, 2)
snake_4 = snake_plot(RFM_scaled, df_rfm_k4, rfm_table)
plt.subplot(3, 1, 3)
snake_5 = snake_plot(RFM_scaled, df_rfm_k5, rfm_table)
plt.tight_layout()
In [64]:
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(RFM_scaled)
cluster_labels = kmeans.labels_
kmeans
Out[64]:
KMeans(n_clusters=4, random_state=1)
In [65]:
# Assign the clusters as column to each customer
Cluster_table = rfm_segment.assign(Cluster=cluster_labels)
# Save RFM Clusters to csv
Cluster_table.to_csv('clusters.csv')
# Check counts of records assigned to different clusters
Cluster_table.Cluster.value_counts()
Out[65]:
0    1234
2    1153
3    1033
1     918
Name: Cluster, dtype: int64

Scatter plots

These will help us visualize the division of customers into different segments based on the RFM atributes.

In [66]:
# Plotting two dimesional plots of each attributes respectively.
X = RFM_scaled.iloc[:, 0:3].values
count = X.shape[1]
for i in range(0, count):
    for j in range(i+1, count):
        plt.figure(figsize=(12, 12))
        plt.scatter(X[cluster_labels == 0, i], X[cluster_labels ==
                    0, j], s=10, c='red', label='Cluster0')
        plt.scatter(X[cluster_labels == 1, i], X[cluster_labels ==
                    1, j], s=10, c='blue', label='Cluster1')
        plt.scatter(X[cluster_labels == 2, i], X[cluster_labels ==
                    2, j], s=10, c='green', label='Cluster2')
        plt.scatter(X[cluster_labels == 3, i], X[cluster_labels ==
                    3, j], s=10, c='purple', label='Cluster3')
        plt.scatter(kmeans.cluster_centers_[:, i], kmeans.cluster_centers_[
                    :, j], s=50, c='black', label='Centroids')
        plt.xlabel(RFM_scaled.columns[i])
        plt.ylabel(RFM_scaled.columns[j])
        plt.legend()
        plt.show()

Most of the customers belong to 0 and 2 cluster, although the distribution is fairly even.

In [67]:
Cluster_table.sample(10)
Out[67]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment Cluster
2101 15214 1 8 1661.44 4 4 4 444 12 Best Customers 1
2967 16392 269 1 217.89 1 1 1 111 3 others 2
374 12822 70 2 948.88 2 2 3 223 7 others 0
728 13314 1 3 775.94 4 3 3 433 10 others 1
2934 16350 21 4 1116.47 3 3 3 333 9 others 3
3731 17454 192 4 517.53 1 3 2 132 6 others 2
1133 13880 21 13 3219.77 3 4 4 344 11 Loyal Customers 1
2720 16054 145 1 783.90 1 1 3 113 5 others 2
3312 16858 368 1 375.69 1 1 2 112 4 others 2
1250 14044 26 4 646.42 3 3 2 332 8 others 0

The RFM score is very low to mid for customers in 0 & 2 cluster. Comparetivey, customers in 1 & 3 clusters have high RFM scores along with above average Recency and frequency values.

Let's checkout customers in each cluster more closely...

In [68]:
Cluster_table[Cluster_table.Cluster == 3].sample(5)
Out[68]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment Cluster
3669 17373 40 4 646.92 3 3 2 332 8 others 3
4314 18251 87 1 4314.72 2 1 4 214 7 others 3
2547 15809 36 5 515.77 3 3 2 332 8 others 3
3475 17086 7 6 2050.08 4 4 4 444 12 Best Customers 3
3966 17770 198 5 1143.27 1 3 3 133 7 others 3

The RFM score for Cluster 3 customers is somewhat average and above collectively. So these are more or less those customers which shops occasionally and can respond to the targeted campaigns.

In [69]:
Cluster_table[Cluster_table.Cluster == 2].sample(5)
Out[69]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment Cluster
4076 17926 133 2 397.29 2 2 2 222 6 others 2
3494 17117 288 1 116.20 1 1 1 111 3 others 2
2510 15758 24 1 205.25 3 1 1 311 5 others 2
3684 17391 163 2 508.80 1 2 2 122 5 Lost Cheap Customers 2
2859 16239 56 2 414.20 2 2 2 222 6 others 2

Cluster 2 contains customers with relatively low recency, frequency and monetary value. Further study and segmentation is needed for this cluster.

In [70]:
Cluster_table[Cluster_table.Cluster == 1].sample(5)
Out[70]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment Cluster
1583 14502 19 11 2429.83 3 4 4 344 11 Loyal Customers 1
714 13297 7 4 2089.85 4 3 4 434 11 others 1
600 13138 22 6 962.39 3 4 3 343 10 others 1
3819 17576 8 19 3564.83 4 4 4 444 12 Best Customers 1
1734 14708 2 3 1126.37 4 3 3 433 10 others 1

Cluster 1 has the customer segment with very high monetary value along with good frequency and recency values. These are the most valuable customers to the firm. They should be looked after periodically to access there concerns.

In [71]:
Cluster_table[Cluster_table.Cluster == 0].sample(5)
Out[71]:
CustomerID Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM Score Segment Cluster
1397 14242 234 2 280.55 1 2 1 121 4 others 0
995 13692 24 3 1488.32 3 3 3 333 9 others 0
135 12516 77 2 1312.06 2 2 3 223 7 others 0
1556 14470 11 2 461.19 4 2 2 422 8 others 0
734 13321 72 2 567.36 2 2 2 222 6 others 0

The last 0 cluster contains a high number of customers who account for low to midrange value for the firm. Most of them are in the lost segment or on the verge of churning out. We really need to differentiate the difference between clusters 0 and 2 more than any other and refine a new marketing strategy.

So the behaviour of customers in different clusters is evident from the above graph:

  • Customer in cluster 1 has highest frequency & monetary along with lowest recency value. These are the most valuable customers to the firm
  • Customers in cluster 3 also shows similar shopping pattern like cluster 1 but on an average less value for each attributes. These are the loyal customers.
  • Customers in cluster 0 shows lower frequency, monetary as well as high recency rates as compared to cluster 2. Some of them are occasional customers who shops during shopping festival. These are more likely to respond to a campaign and has the potential to turn profitable.
  • And the customers in cluster 2 shows a totally opposite behaviour as compared to cluster 1 i.e. the highly unprofitable or the lost customers whom we don't need to pay much attention, May be some targeted offers can bring them back to shopping.

Heat Map

We will utilize heat map to visualize the relative importance of each attributes in all four customer segments i.e. clusters. It calculates importance score by dividing them and subtracting 1 (ensures 0 is returned when cluster average equals population average).

The farther a ratio is from 0, the more important that attribute is for a segment relative to the total population.

In [72]:
# Assign Cluster labels to RFM table
rfm_table_cluster = rfm_table.assign(Cluster=cluster_labels)

# Average attributes for each cluster
cluster_avg = rfm_table_cluster.groupby(['Cluster']).mean()

# Calculate the population average
population_avg = rfm_table.mean()

# Calculate relative importance of attributes by
relative_imp = cluster_avg / population_avg - 1

plt.figure(figsize=(12, 12))
plt.title('Relative importance of attributes')
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()

Dashboard¶

  1. Create a dashboard in tableau by choosing appropriate chart types and metrics useful for the business. The dashboard must entail the following:

    a) Country-wise analysis to demonstrate Average spend. Use a bar chart show monthly figures.

    b) Bar graph of top 15 products which are mostly ordered by the users to show the number of products sold.

    c) Bar graph to show the count of orders Vs. hours throughout the day. What are the peak hours per your chart?

    d) Plot the distribution of RFM values using histogram and frequency-charts.

    e) Plot error(cost) vs no of clusters selected

    f) Visualize to compare the RFM values of the clusters using heatmap