The problem:

In this project, you will analyze a dataset containing data on various customers’ annual spending amounts (reported in

monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features

`'Channel'`

and`'Region'`

will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Let’s setup our environment and load our dataset.

import numpy as np import pandas as pd import seaborn as sns # Import supplementary visualizations code visuals.py import visuals as vs # Pretty display for notebooks from matplotlib import pyplot as plt # Load the wholesale customers dataset try: data = pd.read_csv("customers.csv") data.drop(['Region', 'Channel'], axis = 1, inplace = True) print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape) except: print "Dataset could not be loaded. Is the dataset missing?"

Wholesale customers dataset has 440 samples with 6 features each.

Let’s pick three customers at random and explore their data in more detail.

# Select three indices of your choice you wish to sample from the dataset indices = [8, 17, 371] # Create a DataFrame of the chosen samples samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True) print "Chosen samples of wholesale customers dataset:" print(samples) print "Diff with the mean of the dataset" print(samples - data.mean().round()) print "Diff with the median of the dataset" print(samples - data.median().round()) print "Quartile Visualization" # Import Seaborn, a very powerful library for Data Visualisation import seaborn as sns perc = data.rank(pct=True) perc = 100 * perc.round(decimals=3) perc = perc.iloc[indices] sns.heatmap(perc, vmin=1, vmax=99, annot=True) samples_bar = samples.append(data.describe().loc['mean']) samples_bar.index = indices + ['mean'] _ = samples_bar.plot(kind='bar', figsize=(14,6))

Customer 8 and Customer 17 seem to represent similar types of establishments, so I’m going to try a different set of indices. I chose [5, 31, 301].

Customer 5 appears to be some kind of corner shop stocking fast moving supplies for the neighborhood. Customer 31 appears to be a smaller store in a smaller neighborhood. Customer 301 appears to be a super market stocking large amounts of grocery, milk and detergents_paper supply.

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

Here’s how I will proceed:

- Assign
`new_data`

a copy of the data by removing a feature of choice using the`DataFrame.drop`

function. - Use
`sklearn.cross_validation.train_test_split`

to split the dataset into training and testing sets.- Use the removed feature as your target label. Set a
`test_size`

of`0.25`

and set a`random_state`

.

- Use the removed feature as your target label. Set a
- Import a decision tree regressor, set a
`random_state`

, and fit the learner to the training data. - Report the prediction score of the testing set using the regressor’s
`score`

function.

# Make a copy of the DataFrame, using the 'drop' function to drop the given feature features = data.columns for feature in features: new_data = data.drop(feature, axis = 1) target = data[feature] # Split the data into training and testing sets using the given feature as the target from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split( new_data, target, test_size = 0.25, random_state = 0) # Create a decision tree regressor and fit it to the training set from sklearn.tree import DecisionTreeRegressor regressor = DecisionTreeRegressor(random_state = 0) regressor.fit(X_train, y_train) # Report the score of the prediction using the testing set score = regressor.score(X_test, y_test) print feature, score

Here’s the output:

Fresh -0.252469807688

Milk 0.365725292736

Grocery 0.602801978878

Frozen 0.253973446697

Detergents_Paper 0.728655181254

Delicatessen -11.6636871594

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.

# Produce a scatter matrix for each pair of features in the data pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Here’s a second way to visualize feature distributions:

# Correlations between segments corr = data.corr() mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True with sns.axes_style("white"): ax = sns.heatmap(corr, mask=mask, square=True, annot=True, cmap='RdBu')

There are three pairs with a high correlation: Grocery-Milk Milk-Grocery Milk-Detergents Paper.

The highest correlation is between grocery and detergents paper.

The data is not normally distributed. It is skewed to the right.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

# Scale the data using the natural logarithm log_data = np.log(data) # Scale the sample data using the natural logarithm log_samples = np.log(samples) # Produce a scatter matrix for each pair of newly-transformed features pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde')

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features we identified earlier as being correlated, we can observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

# Display the log-transformed sample data print(log_samples)

I’m going to use Tukey’s Method for identfying outliers to identify outliers in this dataset.

# For each feature find the data points with extreme high or low values for feature in log_data.keys(): # Calculate Q1 (25th percentile of the data) for the given feature Q1 = np.percentile(log_data[feature], 25) # Calculate Q3 (75th percentile of the data) for the given feature Q3 = np.percentile(log_data[feature], 75) # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range) step = 1.5 * (Q3 - Q1) # Display the outliers print "Data points considered outliers for the feature '{}':".format(feature) display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]) # Select the indices for data points you wish to remove outliers = [75] # Remove the outliers, if any were specified good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the `good_data`

to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the *explained variance ratio* of each dimension — how much variance within the data is explained by that dimension alone.

<pre><span class="c1"># Apply PCA to the good data with the same number of dimensions as features</span> <span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span> <span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">good_data</span><span class="o">.</span><span class="n">columns</span><span class="p">))</span> <span class="n">pca</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">good_data</span><span class="p">)</span> <span class="c1"># Apply a PCA transformation to the sample log-data</span> <span class="n">pca_samples</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">log_samples</span><span class="p">)</span> <span class="c1"># Generate PCA results plot</span> <span class="n">pca_results</span> <span class="o">=</span> <span class="n">rs</span><span class="o">.</span><span class="n">pca_results</span><span class="p">(</span><span class="n">good_data</span><span class="p">,</span> <span class="n">pca</span><span class="p">)</span></pre>

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the *cumulative explained variance ratio* is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a significant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

# Fit PCA to the good data using only two dimensions pca = PCA(n_components=2) pca.fit(good_data) # Apply a PCA transformation the good data reduced_data = pca.transform(good_data) # Apply a PCA transformation to the sample log-data pca_samples = pca.transform(log_samples) # Create a DataFrame for the reduced data reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

# Create a biplot vs.biplot(good_data, reduced_data, pca)

From the biplot, Fresh, Frozen and Delicatessen are associated with feature 1; while Detergents Paper, Grocery and Milk are associated with feature 2.

We can choose between two algorithms to make our customer cluster model.

**K-Means clustering and Gaussian Mixture Model clustering.
**

I will start with the K-Means, as I don’t have a complete understanding of the dataset and K-means is usually used as a first algorithm to use for clustering

# Apply your clustering algorithm of choice to the reduced data from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score clusters = range(2, 11) best = (0, 0.0) for each in clusters: clusterer = KMeans(n_clusters=each, random_state=0).fit(reduced_data) # Predict the cluster for each data point preds = clusterer.predict(reduced_data) # Find the cluster centers centers = clusterer.cluster_centers_ # Predict the cluster for each transformed sample data point sample_preds = clusterer.predict(pca_samples) # Calculate the mean silhouette coefficient for the number of clusters chosen score = silhouette_score(reduced_data, preds) print "Clusters:", each, "score:", score if score > best[1]: best = (each, score) clusterer = KMeans(n_clusters=best[0], random_state=0).fit(reduced_data) # Predict the cluster for each data point preds = clusterer.predict(reduced_data) # Find the cluster centers centers = clusterer.cluster_centers_ # Predict the cluster for each transformed sample data point sample_preds = clusterer.predict(pca_samples) # Calculate the mean silhouette coefficient for the number of clusters chosen score = silhouette_score(reduced_data, preds) print "The best n of Clusters:", best[0], "\nScore:", best[1]

Clusters: 2 score: 0.420795773671

Clusters: 3 score: 0.396034911432

Clusters: 4 score: 0.331704488262

Clusters: 5 score: 0.349383709753

Clusters: 6 score: 0.361735087656

Clusters: 7 score: 0.363059697196

Clusters: 8 score: 0.360593881403

Clusters: 9 score: 0.354722206188

Clusters: 10 score: 0.349422838857

The best n of Clusters: 2

Score: 0.420795773671

Each cluster has a central point. These centers (or means) are not specifically data points from the data, but *averages* of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to *the average customer of that segment*. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

# TODO: Inverse transform the centers log_centers = pca.inverse_transform(centers) # TODO: Exponentiate the centers true_centers = np.exp(log_centers) # Display the true centers segments = ['Segment {}'.format(i) for i in range(0,len(centers))] true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys()) true_centers.index = segments print(true_centers) print(data.describe()) print samples