Survey of Consumer Finances: Unsupervised Clustering

Part 3: Clustering with Multiple Features

In [1]:
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"

In the Clustering with Two Features project, I built a K-Means model to create clusters of respondents to the Survey of Consumer Finances. I made my clusters by looking at two features only, but there are hundreds of features in the dataset that I didn't take into account and that could contain valuable information. In this project, I'll examine all the features, selecting five to create clusters with. After I build my model and choose an appropriate number of clusters, I'll visualize multi-dimensional clusters in a 2D scatter plot using principal component analysis (PCA).

In [2]:
import pandas as pd
import plotly.express as px

from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

GOALS

  • Determine highest variance features in dataset.
  • Build unsupervised model to divide households into groups.
  • Examine mean characteristics of each group.
  • Visualize clusters using principal component analysis (PCA).

Machine Learning Workflow

  • Prepare Data
    • Import
    • Explore: Variance and trimmed variance
    • Split: No target vector
  • Build Model
    • Iterate
      • Standard scaler, k-means model, number of clusters.
        • Standardization
  • Communicate Results
    • Side-by-side bar chart
    • PCA scatter plot
      • Dimensionality reduction

Prepare Data¶

Import¶

I spent some time in the Clustering with Two Features project zooming in on a useful subset of the SCF, and this time, I'm going to zoom in even further. One of the persistent issues I've had with this dataset is that it includes some outliers in the form of ultra-wealthy households. This didn't make much of a difference for my last analysis, but it could pose a problem in this project, so I'm going to focus on families with net worth under \$2 million.

Wrangle Data

I'll rewrite my wrangle function from the Clustering with Two Features project so that it returns a DataFrame of households whose net worth is less than $2 million and that have been turned down for credit or feared being denied credit in the past 5 years ("TURNFEAR").

In [3]:
def wrangle(filepath):
    # Read file into DataFrame
    df = pd.read_csv(filepath)
    mask = (df["TURNFEAR"] == 1) & (df["NETWORTH"] < 2e6) 
    df = df[mask]
    return df
In [4]:
df = wrangle("data/SCFP2019.csv.gz")
print(df.shape)
df.head()
(4418, 351)
Out[4]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

5 rows × 351 columns

Explore¶

In this project, I want to make clusters using more than two features, but which of the 351 features should I choose? Here, I will choose the best features for clustering by determining which numerical features have the largest variance.

Calculate Variance

In [5]:
x = df["DEBT"]
x.head()
Out[5]:
5    12200.0
6    12600.0
7    15300.0
8    14100.0
9    15400.0
Name: DEBT, dtype: float64
In [6]:
x.var()
Out[6]:
18482520920.408085
In [7]:
x.mean()
Out[7]:
72701.25848800362
In [8]:
(x - x.mean()).head()
Out[8]:
5   -60501.258488
6   -60101.258488
7   -57401.258488
8   -58601.258488
9   -57301.258488
Name: DEBT, dtype: float64
In [9]:
((x - x.mean())**2).head()
Out[9]:
5    3.660402e+09
6    3.612161e+09
7    3.294904e+09
8    3.434107e+09
9    3.283434e+09
Name: DEBT, dtype: float64
In [10]:
sum((x - x.mean())**2)
Out[10]:
81637294905442.52
In [11]:
sum((x - x.mean())**2) / (len(x) - 1)
Out[11]:
18482520920.408085
In [12]:
df.var()
Out[12]:
YY1             2.876461e+06
Y1              2.876460e+08
WGT             4.808436e+06
HHSEX           2.192779e-01
AGE             2.154642e+02
                    ...     
NWPCTLECAT      5.178676e+00
INCPCTLECAT     6.070208e+00
NINCPCTLECAT    6.136292e+00
INCQRTCAT       9.512230e-01
NINCQRTCAT      9.624869e-01
Length: 351, dtype: float64
In [13]:
df.var().sort_values()
Out[13]:
PAYILN7     0.000000e+00
PAYLCO      0.000000e+00
PAYEDU7     0.000000e+00
PAYILN5     0.000000e+00
PAYILN6     0.000000e+00
                ...     
NHNFIN      2.254163e+10
HOUSES      2.388459e+10
NETWORTH    4.847029e+10
NFIN        5.713939e+10
ASSET       8.303967e+10
Length: 351, dtype: float64
In [14]:
df.var().sort_values().tail(10)
Out[14]:
PLOAN1      1.140894e+10
ACTBUS      1.251892e+10
BUS         1.256643e+10
KGTOTAL     1.346475e+10
DEBT        1.848252e+10
NHNFIN      2.254163e+10
HOUSES      2.388459e+10
NETWORTH    4.847029e+10
NFIN        5.713939e+10
ASSET       8.303967e+10
dtype: float64
In [15]:
# Calculate the variance for all the features in df, and create a Series top_ten_var with the 10 features with the largest variance
top_ten_var = df.var().sort_values().tail(10)
top_ten_var
Out[15]:
PLOAN1      1.140894e+10
ACTBUS      1.251892e+10
BUS         1.256643e+10
KGTOTAL     1.346475e+10
DEBT        1.848252e+10
NHNFIN      2.254163e+10
HOUSES      2.388459e+10
NETWORTH    4.847029e+10
NFIN        5.713939e+10
ASSET       8.303967e+10
dtype: float64

It's harder to make sense of a list like this than it would be if I visualized it, so I'll make a graph.

Plot Variance

In [16]:
# Use plotly express to create a horizontal bar chart of "top_ten_var"
fig = px.bar(
    x=top_ten_var,
    y=top_ten_var.index,
    title="SCF: High Variance Features" # Title "SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Variance", yaxis_title="Feature")  # Label x-axis "Variance" & y-axis "Feature"
fig.show()

One thing that I've seen throughout this project is that many of the wealth indicators are highly skewed, with a few outlier households having enormous wealth. Those outliers can affect my measure of variance. I'll see if that's the case with one of the features from top_five_var.

Plot Distribution of NHNFIN

In [17]:
# Use plotly express to create a horizontal boxplot of "NHNFIN" to determine if the values are skewed
fig = px.box(
    data_frame=df,
    x="NHNFIN", 
    title="Distribution of Non-home, Non-Financial Assets", # Title "Distribution of Non-home, Non-Financial Assets"
)
fig.update_layout(xaxis_title="Value [$]")  # Label the x-axis "Value [$]"
fig.show()

The dataset is massively right-skewed because of the huge outliers on the right side of the distribution. Even though I already excluded households with a high net worth with our wrangle function, the variance is still being distorted by some extreme outliers.

The best way to deal with this is to look at the trimmed variance, where I remove extreme values before calculating variance. I can do this using the trimmed_variance function from the SciPy library.

Calculate Trimmed Variance

I now want to calculate the trimmed variance for the features in df. My calculations should not include the top and bottom 10% of observations. Then I'll create a Series top_ten_trim_var with the 10 features with the largest variance.

In [18]:
df["DEBT"].var()
Out[18]:
18482520920.408085
In [19]:
trimmed_var?
Signature:
trimmed_var(
    a,
    limits=(0.1, 0.1),
    inclusive=(1, 1),
    relative=True,
    axis=None,
    ddof=0,
)
Docstring:
Returns the trimmed variance of the data along the given axis.


Parameters
----------
a : sequence
    Input array
limits : {None, tuple}, optional
    If `relative` is False, tuple (lower limit, upper limit) in absolute values.
    Values of the input array lower (greater) than the lower (upper) limit are
    masked.

    If `relative` is True, tuple (lower percentage, upper percentage) to cut
    on each side of the  array, with respect to the number of unmasked data.

    Noting n the number of unmasked data before trimming, the (n*limits[0])th
    smallest data and the (n*limits[1])th largest data are masked, and the
    total number of unmasked data after trimming is n*(1.-sum(limits))
    In each case, the value of one limit can be set to None to indicate an
    open interval.

    If limits is None, no trimming is performed
inclusive : {(bool, bool) tuple}, optional
    If `relative` is False, tuple indicating whether values exactly equal
    to the absolute limits are allowed.
    If `relative` is True, tuple indicating whether the number of data
    being masked on each side should be rounded (True) or truncated
    (False).
relative : bool, optional
    Whether to consider the limits as absolute values (False) or proportions
    to cut (True).
axis : int, optional
    Axis along which to trim.

ddof : {0,integer}, optional
    Means Delta Degrees of Freedom. The denominator used during computations
    is (n-ddof). DDOF=0 corresponds to a biased estimate, DDOF=1 to an un-
    biased estimate of the variance.
File:      /opt/conda/lib/python3.9/site-packages/scipy/stats/mstats_basic.py
Type:      function
In [20]:
trimmed_var(df["DEBT"])
Out[20]:
3089864647.655702
In [21]:
df.var()
Out[21]:
YY1             2.876461e+06
Y1              2.876460e+08
WGT             4.808436e+06
HHSEX           2.192779e-01
AGE             2.154642e+02
                    ...     
NWPCTLECAT      5.178676e+00
INCPCTLECAT     6.070208e+00
NINCPCTLECAT    6.136292e+00
INCQRTCAT       9.512230e-01
NINCQRTCAT      9.624869e-01
Length: 351, dtype: float64
In [22]:
df.apply(trimmed_var)
Out[22]:
YY1             1.850508e+06
Y1              1.850507e+08
WGT             1.412290e+06
HHSEX           2.019627e-01
AGE             1.139698e+02
                    ...     
NWPCTLECAT      2.580682e+00
INCPCTLECAT     3.429553e+00
NINCPCTLECAT    3.523264e+00
INCQRTCAT       5.850373e-01
NINCQRTCAT      6.002715e-01
Length: 351, dtype: float64
In [23]:
df.apply(trimmed_var, limits=(0.1, 0.1))
Out[23]:
YY1             1.850508e+06
Y1              1.850507e+08
WGT             1.412290e+06
HHSEX           2.019627e-01
AGE             1.139698e+02
                    ...     
NWPCTLECAT      2.580682e+00
INCPCTLECAT     3.429553e+00
NINCPCTLECAT    3.523264e+00
INCQRTCAT       5.850373e-01
NINCQRTCAT      6.002715e-01
Length: 351, dtype: float64
In [24]:
df.apply(trimmed_var, limits=(0.1, 0.1)).sort_values()
Out[24]:
HOTHMA      0.000000e+00
NOTXBND     0.000000e+00
MORTBND     0.000000e+00
GOVTBND     0.000000e+00
OBND        0.000000e+00
                ...     
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
Length: 351, dtype: float64
In [25]:
# Calculate trimmed variance
df.apply(trimmed_var, limits=(0.1, 0.1)).sort_values().tail(10)
Out[25]:
WAGEINC     5.550737e+08
HOMEEQ      7.338377e+08
NH_MORT     1.333125e+09
MRTHEL      1.380468e+09
PLOAN1      1.441968e+09
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64
In [26]:
# Create a Series top_ten_trim_var with the 10 features with the largest variance
top_ten_trim_var = df.apply(trimmed_var, limits=(0.1, 0.1)).sort_values().tail(10)
top_ten_trim_var
Out[26]:
WAGEINC     5.550737e+08
HOMEEQ      7.338377e+08
NH_MORT     1.333125e+09
MRTHEL      1.380468e+09
PLOAN1      1.441968e+09
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64

Now that I've got a better set of numbers, I'll make another bar graph.

Plot Trimmed Variance

In [27]:
# Use plotly express to create a horizontal bar chart of top_ten_trim_var
fig = px.bar(
    x=top_ten_trim_var,
    y=top_ten_trim_var.index,
    title="SCF: High Variance Features" # Title "SCF: High Variance Features"
)
fig.update_layout(xaxis_title="Trimmed Variance", yaxis_title="Feature") # Label x-axis "Trimmed Variance" & y-axis "Feature"
fig.show()

There are three things to notice in this plot. First, the variances have decreased a lot. In my previous chart, the x-axis went up to \$80 billion; this one goes up to \$12 billion. Second, the top 10 features have changed a bit. All the features relating to business ownership ("...BUS") are gone. Finally, I can see that there are big differences in variance from feature to feature. For example, the variance for "WAGEINC" is around than \$500 million, while the variance for "ASSET" is nearly \$12 billion. In other words, these features have completely different scales. This is something that I'll need to address before I can make good clusters.

Extract Feature Names

In [28]:
top_ten_trim_var
Out[28]:
WAGEINC     5.550737e+08
HOMEEQ      7.338377e+08
NH_MORT     1.333125e+09
MRTHEL      1.380468e+09
PLOAN1      1.441968e+09
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64
In [29]:
top_ten_trim_var.tail(5)
Out[29]:
DEBT        3.089865e+09
NETWORTH    3.099929e+09
HOUSES      4.978660e+09
NFIN        8.456442e+09
ASSET       1.175370e+10
dtype: float64
In [30]:
top_ten_trim_var.tail(5).index
Out[30]:
Index(['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET'], dtype='object')
In [31]:
top_ten_trim_var.tail(5).index.to_list()
Out[31]:
['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']
In [32]:
# Generate a list high_var_cols with the column names of the five features with the highest trimmed variance
high_var_cols = top_ten_trim_var.tail(5).index.to_list()
high_var_cols
Out[32]:
['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']

Split¶

Now that I've gotten my data to a place where I can use it, I can follow the steps I've used before to build a model, starting with a feature matrix.

Vertical Split

In [33]:
# Create the feature matrix X
X = df[high_var_cols]
print("X shape:", X.shape)
X.head() #  It should contain the five columns in high_var_cols
X shape: (4418, 5)
Out[33]:
DEBT NETWORTH HOUSES NFIN ASSET
5 12200.0 -6710.0 0.0 3900.0 5490.0
6 12600.0 -4710.0 0.0 6300.0 7890.0
7 15300.0 -8115.0 0.0 5600.0 7185.0
8 14100.0 -2510.0 0.0 10000.0 11590.0
9 15400.0 -5715.0 0.0 8100.0 9685.0

Build Model¶

Iterate¶

During my EDA, I saw that I had a scale issue among my features. That issue can make it harder to cluster the data, so I'll need to fix that to help my analysis along. One strategy I can use is standardization, a statistical method for putting all the variables in a dataset on the same scale.

Standardization: Aggregating Mean and STD

In [34]:
X["DEBT"].mean()
Out[34]:
72701.25848800362
In [35]:
X.mean()
Out[35]:
DEBT         72701.258488
NETWORTH     76387.768900
HOUSES       74530.805794
NFIN        117330.637166
ASSET       149089.027388
dtype: float64
In [36]:
X.aggregate(["mean", "std"])
Out[36]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 72701.258488 76387.768900 74530.805794 117330.637166 149089.027388
std 135950.435529 220159.684405 154546.415791 239038.471726 288166.040553
In [37]:
X.aggregate(["mean", "std"]).astype(int)
Out[37]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 72701 76387 74530 117330 149089
std 135950 220159 154546 239038 288166
In [38]:
# Create a DataFrame X_summary with the mean and standard deviation for all the features in X
X_summary = X.aggregate(["mean", "std"]).astype(int)
X_summary
Out[38]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 72701 76387 74530 117330 149089
std 135950 220159 154546 239038 288166

That's the information I need to standardize my data.

Standardization: Standard Scaler

I can create a StandardScaler transformer, use it to transform the data in X, and then put the transformed data into a DataFrame named X_scaled.

In [43]:
x = X["DEBT"]
x.head()
Out[43]:
5    12200.0
6    12600.0
7    15300.0
8    14100.0
9    15400.0
Name: DEBT, dtype: float64
In [44]:
x = X["DEBT"]
print("mean", round(x.mean()))
print("std", round(x.std()))
mean 72701
std 135950
In [45]:
(x - x.mean()).head()
Out[45]:
5   -60501.258488
6   -60101.258488
7   -57401.258488
8   -58601.258488
9   -57301.258488
Name: DEBT, dtype: float64
In [46]:
(x - x.mean()) / x.std()
Out[46]:
5       -0.445024
6       -0.442082
7       -0.422222
8       -0.431049
9       -0.421486
           ...   
28865   -0.458265
28866   -0.458265
28867   -0.458265
28868   -0.458265
28869   -0.458265
Name: DEBT, Length: 4418, dtype: float64
In [47]:
x_scaled = (x - x.mean()) / x.std()
x_scaled.head()
Out[47]:
5   -0.445024
6   -0.442082
7   -0.422222
8   -0.431049
9   -0.421486
Name: DEBT, dtype: float64
In [48]:
x_scaled = (x - x.mean()) / x.std()
print("mean", round(x_scaled.mean()))
print("std", round(x_scaled.std()))
mean 0
std 1
In [49]:
# Instantiate transformer
ss = StandardScaler()

# Transform `X`
X_scaled_data = ss.fit_transform(X)
In [50]:
type(X_scaled_data)
Out[50]:
numpy.ndarray
In [51]:
# Instantiate transformer
ss = StandardScaler()

# Transform `X`
X_scaled_data = ss.fit_transform(X)

# Put `X_scaled_data` into DataFrame
X_scaled = pd.DataFrame(X_scaled_data, columns=X.columns)

print("X_scaled shape:", X_scaled.shape)
X_scaled.head()
X_scaled shape: (4418, 5)
Out[51]:
DEBT NETWORTH HOUSES NFIN ASSET
0 -0.445075 -0.377486 -0.48231 -0.474583 -0.498377
1 -0.442132 -0.368401 -0.48231 -0.464541 -0.490047
2 -0.422270 -0.383868 -0.48231 -0.467470 -0.492494
3 -0.431097 -0.358407 -0.48231 -0.449061 -0.477206
4 -0.421534 -0.372966 -0.48231 -0.457010 -0.483818

All five of the features use the same scale now. But just to make sure, I'll take a look at their mean and standard deviation.

Standardization: Check Mean and STD

In [52]:
# Create a DataFrame X_scaled_summary with the mean and standard deviation for all the features in X_scaled
X_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)
X_scaled_summary
Out[52]:
DEBT NETWORTH HOUSES NFIN ASSET
mean 0 0 0 0 0
std 1 1 1 1 1

And that's what it should look like. Standardization takes all the features and scales them so that they all have a mean of 0 and a standard deviation of 1.

Now that I can compare all my data on the same scale, I can start making clusters. Now, I need to figure out how many clusters I should have.

Build and Tune Model

Here I'll use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12 (inclusive). My model should include a StandardScaler. Each time a model is trained, it'll calculate the inertia and add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.

In [53]:
n_clusters = range(2, 13)
inertia_errors = []
silhouette_scores = []

# Add `for` loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
    # Build model 
    model = make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42))
    # Train model 
    model.fit(X)
    # Calculate intertia
    inertia_errors.append(model.named_steps["kmeans"].inertia_)
    # Calculate silhouette score
    silhouette_scores.append(
        silhouette_score(X, model.named_steps["kmeans"].labels_)
    )

print("Inertia:", inertia_errors[:3])
print()
print("Silhouette Scores:", silhouette_scores[:3])
Inertia: [11028.058082607145, 7190.526303575355, 5924.997726868041]

Silhouette Scores: [0.7464502937083215, 0.7044601307791996, 0.6962653079183132]

I'll now create an elbow plot to see how many clusters I should use.

Plot Inertia vs Clusters

In [54]:
# Use plotly express to create a line plot that shows the values of inertia_errors as a function of n_clusters
fig = px.line(
    x=n_clusters, y=inertia_errors, title="K-Means Model: Inertia vs Number of Clusters"
)
fig.update_layout(xaxis_title="Number of Clusters (k)", yaxis_title="Inertia")  # Label x-axis "Number of Clusters", y-axis "Inertia", and title "K-Means Model: Inertia vs Number of Clusters"
fig.show()

I can see that the line starts to flatten out around 4 or 5 clusters.

I'll make another line plot based on the silhouette scores.

Plot Silhouette Score vs Clusters

In [55]:
# Use plotly express to create a line plot that shows the values of silhouette_scores as a function of n_clusters
fig = px.line(
    x=n_clusters,
    y=silhouette_scores,
    title="K-Means Model: Silhouette Score vs Number of Clusters"
)
fig.update_layout(
    xaxis_title="Number of Clusters", yaxis_title="Silhouette Score") # Label x-axis "Number of Clusters", y-axis "Silhouette Score", and title "K-Means Model: Silhouette Score vs Number of Clusters"
fig.show()

This one's a little less straightforward, but I can see that the best silhouette scores occur when there are 3 or 4 clusters.

Putting the information from this plot together with my inertia plot, it seems like the best setting for n_clusters will be 4.

Build Final Model

In [56]:
# Build and train a new k-means model named final_model
final_model = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=4, random_state=42)
)
final_model.fit(X)
Out[56]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4, random_state=42))])
StandardScaler()
KMeans(n_clusters=4, random_state=42)

Communicate¶

Extract Labels

In [57]:
# Extract the labels that my final_model created during training and assign them to the variable labels
labels = final_model.named_steps["kmeans"].labels_
print(labels[:5])
[0 0 0 0 0]

I'm going to make a visualization, so I need to create a new DataFrame to work with.

Side-by-Side Bar Chart: Get Centroids

In [58]:
# Create a DataFrame xgb that contains the mean values of the features in X for each of the clusters in my final_model
xgb = X.groupby(labels).mean()
xgb
Out[58]:
DEBT NETWORTH HOUSES NFIN ASSET
0 26551.075439 13676.153182 13745.637777 2.722605e+04 4.022723e+04
1 218112.818182 174713.441558 257403.246753 3.305884e+05 3.928263e+05
2 116160.779817 965764.155963 264339.449541 7.800611e+05 1.081925e+06
3 732937.575758 760397.575758 826136.363636 1.276227e+06 1.493335e+06

Now that I have a DataFrame, I'll make a bar chart and see how my clusters differ.

Side-by-Side Bar Chart: Build Chart

In [59]:
# Use plotly express to create a side-by-side bar chart from xgb that shows the mean of the features in X for each of the clusters in my final_model
fig = px.bar(
    xgb,
    barmode="group",
    title="Mean Household Finances by Cluster" # Title "Mean Household Finances by Cluster"
)

fig.update_layout(xaxis_title="Cluster", yaxis_title="Value [$]") # Label x-axis "Cluster" & the y-axis "Value [$]"
fig.show()

My clusters are based partially on NETWORTH, which means that the households in the 0 cluster have the smallest net worth, and the households in the 2 cluster have the highest. Based on that, there are some interesting things to unpack here.

First, I'll take a look at the DEBT variable. I might think that it would scale as net worth increases, but it doesn't. The lowest amount of debt is carried by the households in cluster 2, even though the value of their houses (shown in green) is roughly the same. I can't really tell from this data what's going on, but one possibility might be that the people in cluster 2 have enough money to pay down their debts, but not quite enough money to leverage what they have into additional debts. The people in cluster 3, by contrast, might not need to worry about carrying debt because their net worth is so high.

Finally, since I started out this project looking at home values, I'll take a look at the relationship between DEBT and HOUSES. The value of the debt for the people in cluster 0 is higher than the value of their houses, suggesting that most of the debt being carried by those people is tied up in their mortgages — if they own a home at all. Contrast that with the other three clusters: the value of everyone else's debt is lower than the value of their homes.

It's different from what I did last time. At this point in the K-Means Clustering project, I made a scatter plot. This was a straightforward task because I only worked with two features, so I could plot the data points in two dimensions. But now X has five dimensions.

Since I'm working with a computer screen, I don't have much of a choice about the number of dimensions I can use: it's got to be two. So, if I'm going to do anything like the scatter plot I made before, I'll need to take my 5-dimensional data and change it into something I can look at in 2 dimensions.

Principal Component Analysis

I'll need to create a PCA transformer, use it to reduce the dimensionality of the data in X to 2, and then put the transformed data into a DataFrame named X_pca. The columns of X_pca should be named "PC1" and "PC2".

In [62]:
# Instantiate transformer
pca = PCA(n_components=2, random_state=42)

# Transform `X`
X_t = pca.fit_transform(X)
In [63]:
X_t
Out[63]:
array([[-221525.42453041,  -22052.27300297],
       [-217775.10072188,  -22851.35806823],
       [-219519.64217476,  -19023.64633269],
       ...,
       [ 323306.39571853, -182741.08760824],
       [ 327948.76711284, -184322.85705512],
       [ 334191.95622931, -186450.06424231]])
In [64]:
# Instantiate transformer
pca = PCA(n_components=2, random_state=42)

# Transform `X`
X_t = pca.fit_transform(X)

# Put `X_t` into DataFrame
X_pca = pd.DataFrame(X_t, columns=["PC1", "PC2"] )

print("X_pca shape:", X_pca.shape)
X_pca.head()
X_pca shape: (4418, 2)
Out[64]:
PC1 PC2
0 -221525.424530 -22052.273003
1 -217775.100722 -22851.358068
2 -219519.642175 -19023.646333
3 -212195.720367 -22957.107039
4 -215540.507551 -20259.749306

My five dimensions have been reduced to two. I'll make a scatter plot and see what I get.

PCA Scatter Plot

I'll use plotly express to create a scatter plot of X_pca using seaborn. I'll color the data points using the labels generated by my final_model.

In [68]:
labels
Out[68]:
array([0, 0, 0, ..., 1, 1, 1], dtype=int32)
In [69]:
labels.astype(str)
Out[69]:
array(['0', '0', '0', ..., '1', '1', '1'], dtype='<U11')
In [70]:
# Create scatter plot of `PC2` vs `PC1`
fig = px.scatter(
    data_frame=X_pca,
    x="PC1",
    y="PC2",
    color=labels,
    title="PCA Representation of Clusters" # Title "PCA Representation of Clusters"
)
fig.update_layout(xaxis_title="PC1", yaxis_title="PC2") # Label the x-axis "PC1" & the y-axis "PC2"
fig.show()

One limitation of this plot is that it's hard to explain what the axes here represent. In fact, both of them are a combination of the five features I originally had in X, which means this is pretty abstract.

So what does this graph mean? It means that I made four tightly-grouped clusters that share some key features.