Customer Segmentation

The problem:

In this project, you will analyze a dataset containing data on various customers’ annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Let’s setup our environment and load our dataset.


import numpy as np
import pandas as pd
import seaborn as sns

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
from matplotlib import pyplot as plt

# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
print "Dataset could not be loaded. Is the dataset missing?"

Wholesale customers dataset has 440 samples with 6 features each.

Let’s pick three customers at random and explore their data in more detail.


# Select three indices of your choice you wish to sample from the dataset
indices = [8, 17, 371]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
print(samples)

print "Diff with the mean of the dataset"
print(samples - data.mean().round())

print "Diff with the median of the dataset"
print(samples - data.median().round())

print "Quartile Visualization"

# Import Seaborn, a very powerful library for Data Visualisation
import seaborn as sns
perc = data.rank(pct=True)
perc = 100 * perc.round(decimals=3)
perc = perc.iloc[indices]
sns.heatmap(perc, vmin=1, vmax=99, annot=True)

samples_bar = samples.append(data.describe().loc['mean'])
samples_bar.index = indices + ['mean']
_ = samples_bar.plot(kind='bar', figsize=(14,6))

 

 

heatmapbar_plot

Customer 8 and Customer 17 seem to represent similar types of establishments, so I’m going to try a different set of indices. I chose [5, 31, 301].

heatmap2bar_plot2

Customer 5 appears to be some kind of corner shop stocking fast moving supplies for the neighborhood. Customer 31 appears to be a smaller store in a smaller neighborhood. Customer 301 appears to be a super market stocking large amounts of grocery, milk and detergents_paper supply.

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

Here’s how I will proceed:

  • Assign new_data a copy of the data by removing a feature of choice using the DataFrame.drop function.
  • Use sklearn.cross_validation.train_test_split to split the dataset into training and testing sets.
    • Use the removed feature as your target label. Set a test_size of 0.25 and set a random_state.
  • Import a decision tree regressor, set a random_state, and fit the learner to the training data.
  • Report the prediction score of the testing set using the regressor’s score function.

# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
features = data.columns

for feature in features:
new_data = data.drop(feature, axis = 1)
target = data[feature]

# Split the data into training and testing sets using the given feature as the target
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
new_data, target, test_size = 0.25, random_state = 0)

# Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)

print feature, score

Here’s the output:

Fresh                          -0.252469807688
Milk                              0.365725292736
Grocery                        0.602801978878
Frozen                          0.253973446697
Detergents_Paper    0.728655181254
Delicatessen              -11.6636871594

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.


# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

 

 

scatterplot matrix

Here’s a second way to visualize feature distributions:

# Correlations between segments
corr = data.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, square=True, annot=True, cmap='RdBu')

 

 

corr heatmap

There are three pairs with a high correlation: Grocery-Milk Milk-Grocery Milk-Detergents Paper.

The highest correlation is between grocery and detergents paper.

The data is not normally distributed. It is skewed to the right.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.


# Scale the data using the natural logarithm
log_data = np.log(data)

# Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde')

transformed distribution.png

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features we identified earlier as being correlated, we can observe here whether that correlation is still present (and whether it is now stronger or weaker than before).


# Display the log-transformed sample data
print(log_samples)

I’m going to use Tukey’s Method for identfying outliers to identify outliers in this dataset.


# For each feature find the data points with extreme high or low values
for feature in log_data.keys():

# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature], 25)

# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature], 75)

# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5 * (Q3 - Q1)

# Display the outliers
print "Data points considered outliers for the feature '{}':".format(feature)
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])

# Select the indices for data points you wish to remove
outliers  = [75]

# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

<pre><span class="c1"># Apply PCA to the good data with the same number of dimensions as features</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>

<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">good_data</span><span class="o">.</span><span class="n">columns</span><span class="p">))</span>
<span class="n">pca</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">good_data</span><span class="p">)</span>

<span class="c1"># Apply a PCA transformation to the sample log-data</span>
<span class="n">pca_samples</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">log_samples</span><span class="p">)</span>

<span class="c1"># Generate PCA results plot</span>
<span class="n">pca_results</span> <span class="o">=</span> <span class="n">rs</span><span class="o">.</span><span class="n">pca_results</span><span class="p">(</span><span class="n">good_data</span><span class="p">,</span> <span class="n">pca</span><span class="p">)</span></pre>

pca visualized

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a significant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.


# Fit PCA to the good data using only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)

# Apply a PCA transformation the good data
reduced_data = pca.transform(good_data)

# Apply a PCA transformation to the sample log-data
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.


# Create a biplot
vs.biplot(good_data, reduced_data, pca)

biplot

From the biplot, Fresh, Frozen and Delicatessen are associated with feature 1; while Detergents Paper, Grocery and Milk are associated with feature 2.

We can choose between two algorithms to make our customer cluster model.

K-Means clustering and Gaussian Mixture Model clustering.

I will start with the K-Means, as I don’t have a complete understanding of the dataset and K-means is usually used as a first algorithm to use for clustering


# Apply your clustering algorithm of choice to the reduced data
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

clusters = range(2, 11)
best = (0, 0.0)

for each in clusters:
clusterer = KMeans(n_clusters=each, random_state=0).fit(reduced_data)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.cluster_centers_

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)

# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds)
print "Clusters:", each, "score:", score
if score > best[1]:
best = (each, score)

clusterer = KMeans(n_clusters=best[0], random_state=0).fit(reduced_data)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.cluster_centers_

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)

# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds)

print "The best n of Clusters:", best[0], "\nScore:", best[1]

Clusters: 2 score: 0.420795773671
Clusters: 3 score: 0.396034911432
Clusters: 4 score: 0.331704488262
Clusters: 5 score: 0.349383709753
Clusters: 6 score: 0.361735087656
Clusters: 7 score: 0.363059697196
Clusters: 8 score: 0.360593881403
Clusters: 9 score: 0.354722206188
Clusters: 10 score: 0.349422838857
The best n of Clusters: 2

Score: 0.420795773671

cluster_viz

Each cluster has a central point. These centers (or means) are not specifically data points from the data, but averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.


# TODO: Inverse transform the centers
log_centers = pca.inverse_transform(centers)

# TODO: Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
print(true_centers)
print(data.describe())
print samples

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s