Survey of Consumer Finances: Unsupervised Clustering

Part 1: Exploring the Data

In [1]:
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"

In this project, I'm going to work with data from the Survey of Consumer Finances (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and I'll work with an extract of the results from 2019.

GOALS

  • Subset my data to focus on "credit fearful" households.

  • Explore demographics information like age, race, education, income, and assets.

  • Create side-by-side bar charts to highlight differences between "credit fearful" and non-credit fearful households.

  • Prepare Data

    • Import
    • Explore: Age, race, income, and assets
      • Method chaining
In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Prepare Data¶

Import¶

First, I need to load the data, which is stored in a compressed CSV file: SCFP2019.csv.gz. In the last project, I decompressed files using gzip and the command line. However, pandas read_csv function can work with compressed files directly.

Import Data

In [3]:
# Read the file "data/SCFP2019.csv.gz" into the DataFrame df
df = pd.read_csv("data/SCFP2019.csv.gz")
print("df shape:", df.shape)
df.head()
df shape: (28885, 351)
Out[3]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
0 1 11 6119.779308 2 75 6 12 4 2 0 ... 5 3 6 3 2 10 6 6 3 3
1 1 12 4712.374912 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
2 1 13 5145.224455 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2
3 1 14 5297.663412 2 75 6 12 4 2 0 ... 5 2 6 2 1 10 4 4 2 2
4 1 15 4761.812371 2 75 6 12 4 2 0 ... 5 3 6 3 1 10 5 5 2 2

5 rows × 351 columns

One of the first things I notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! I won't have time to explore all of the features in this dataset, but I can look at the links to the official Code Book. For now, I'll just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.

For this project, I'm going to focus on households that have "been turned down for credit or feared being denied credit in the past 5 years." These households are identified in the "TURNFEAR" column.

Subset Data

In [4]:
# Use a mask to subset create df to only households that have been turned down or feared being turned down for credit ("TURNFEAR" == 1)
mask = df["TURNFEAR"] == 1
mask.head(10)
Out[4]:
0    False
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
9     True
Name: TURNFEAR, dtype: bool
In [5]:
mask.sum()
Out[5]:
4623
In [6]:
mask = df["TURNFEAR"] == 1
df_fear = df[mask]  # Assign the subset to the variable name df_fear
print("df_fear shape:", df_fear.shape)
df_fear.head()
df_fear shape: (4623, 351)
Out[6]:
YY1 Y1 WGT HHSEX AGE AGECL EDUC EDCL MARRIED KIDS ... NWCAT INCCAT ASSETCAT NINCCAT NINC2CAT NWPCTLECAT INCPCTLECAT NINCPCTLECAT INCQRTCAT NINCQRTCAT
5 2 21 3790.476607 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
6 2 22 3798.868505 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 3 2 2
7 2 23 3799.468393 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
8 2 24 3788.076005 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2
9 2 25 3793.066589 1 50 3 8 2 1 3 ... 1 2 1 2 1 1 4 4 2 2

5 rows × 351 columns

Explore¶

Age¶

Now that I have my subset, I'll explore the characteristics of this group. One of the features is age group ("AGECL").

Age Groups

In [7]:
df_fear["AGECL"].head()
Out[7]:
5    3
6    3
7    3
8    3
9    3
Name: AGECL, dtype: int64
In [8]:
df_fear["AGECL"].nunique()
Out[8]:
6
In [9]:
# Create a list age_groups with the unique values in the "AGECL" column
age_groups = df_fear["AGECL"].unique()
print("Age Groups:", age_groups)
Age Groups: [3 5 1 2 4 6]

Looking at the Code Book we can see that "AGECL" represents categorical data, even though the values in the column are numeric.

This simplifies data storage, but it's not very human-readable. So before I create a visualization, I'll create a version of this column that uses the actual group names.

Replace Age Groups

In [10]:
df_fear["AGECL"].head(10)
Out[10]:
5      3
6      3
7      3
8      3
9      3
110    5
111    5
112    5
113    5
114    5
Name: AGECL, dtype: int64
In [11]:
# Create a Series agecl that contains the observations from "AGECL" using the true group names
agecl_dict = {
    1: "Under 35",
    2: "35-44",
    3: "45-54",
    4: "55-64",
    5: "65-74",
    6: "75 or Older",
}

age_cl = df_fear["AGECL"].replace(agecl_dict)
age_cl.head(10)
Out[11]:
5      45-54
6      45-54
7      45-54
8      45-54
9      45-54
110    65-74
111    65-74
112    65-74
113    65-74
114    65-74
Name: AGECL, dtype: object

Now that I have better labels, I'll make a bar chart and see the age distribution of my group.

Age Group Bar Chart

In [12]:
age_cl.head()
Out[12]:
5    45-54
6    45-54
7    45-54
8    45-54
9    45-54
Name: AGECL, dtype: object
In [13]:
age_cl.value_counts()
Out[13]:
Under 35       1307
35-44          1093
45-54           932
55-64           745
65-74           401
75 or Older     145
Name: AGECL, dtype: int64
In [14]:
# Create a bar chart showing the value counts from age_cl
age_cl_value_counts = age_cl.value_counts()

# Bar plot of `age_cl_value_counts`
age_cl.value_counts().plot(
    kind="bar",
    xlabel="Age Group",  # Label the x-axis "Age Group"
    ylabel="Frequency (count)", # Label the y-axis "Frequency (count)"
    title="Credit Fearful: Age Groups" # Title "Credit Fearful: Age Groups".
);

I noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. My chart is telling me that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.

To do that, I'll need to look at a different variable: "AGE". Whereas "AGECL" was a categorical variable, "AGE" is continuous, so I can use it to make a histogram of my own.

Age Histogram

In [15]:
# Create a histogram of the "AGE" column with 10 bins
df_fear["AGE"].hist(bins=10)
plt.xlabel("Age")  # Label the x-axis "Age"
plt.ylabel("Frequency (count)") # Label the y-axis "Frequency (count)"
plt.title("Credit Fearful: Age Distribution"); # Title "Credit Fearful: Age Distribution"

It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are most concerned seem to be between 30 and 40.

Race¶

Now that I have an understanding of how age relates to my outcome of interest, I'll try some other possibilities, starting with race. If I look at the Code Book for "RACE", I can see that there are 4 categories.

Note that there's no 4 category here. If a value for 4 did exist, it would be reasonable to assign it to "Asian American / Pacific Islander" — a group that doesn't seem to be represented in the dataset. This dataset doesn't provide a complete picture of race in America.

Race Bart Chart:Credit Fearful

In [16]:
race_dict = {
    1: "White/Non-Hispanic",
    2: "Black/African-American",
    3: "Hispanic",
    5: "Other",
}
race = df_fear["RACE"].replace(race_dict)
race.head(10)
Out[16]:
5      White/Non-Hispanic
6      White/Non-Hispanic
7      White/Non-Hispanic
8      White/Non-Hispanic
9      White/Non-Hispanic
110    White/Non-Hispanic
111    White/Non-Hispanic
112    White/Non-Hispanic
113    White/Non-Hispanic
114    White/Non-Hispanic
Name: RACE, dtype: object
In [17]:
race_dict = {
    1: "White/Non-Hispanic",
    2: "Black/African-American",
    3: "Hispanic",
    5: "Other",
}
race = df_fear["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize=True)
race_value_counts
Out[17]:
White/Non-Hispanic        0.539477
Black/African-American    0.237508
Hispanic                  0.162232
Other                     0.060783
Name: RACE, dtype: float64
In [18]:
# Create a horizontal bar chart showing the normalized value counts for "RACE"
race_dict = {
    1: "White/Non-Hispanic",
    2: "Black/African-American",
    3: "Hispanic",
    5: "Other",
}
race = df_fear["RACE"].replace(race_dict) # Replace the numerical values with the true group names
race_value_counts = race.value_counts(normalize=True)  

# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)") # Label the x-axis "Frequency (%)"
plt.ylabel("Race") # Label the y-axis "Race"
plt.title("Credit Fearful: Racial Groups"); # Title "Credit Fearful: Racial Groups"

This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what I'm seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.

Race Bar Chart: Whole Dataset

In [19]:
# Recreate the horizontal bar chart I just made using the entire dataset df instead of the subset df_fear
race = df["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize=True)
# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)")
plt.ylabel("Race")
plt.title("SCF Respondents: Racial Groups");  # Title of this plot should be "SCF Respondents: Racial Groups"

How does this second bar chart change the perception of the first one? On the one hand, I can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually more likely to be in the credit fearful group.

Income¶

What about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, I'll need to again compare the entire dataset with my subgroup using the "INCCAT" feature, which captures income percentile groups. This time, though, I'll make a single, side-by-side bar chart.

Income Categories: Credit Fearful vs Credit Fearless

In [20]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .to_frame()
    .reset_index()

)

df_inccat
Out[20]:
TURNFEAR INCCAT frequency
0 0 90-100 0.297296
1 0 60-79.9 0.174841
2 0 40-59.9 0.143146
3 0 0-20 0.140343
4 0 21-39.9 0.135933
5 0 80-89.9 0.108441
6 1 0-20 0.288125
7 1 21-39.9 0.256327
8 1 40-59.9 0.228856
9 1 60-79.9 0.132598
10 1 90-100 0.048886
11 1 80-89.9 0.045209
In [21]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)

)

df_inccat.head()
Out[21]:
0    40-59.9
1    40-59.9
2    40-59.9
3    21-39.9
4    40-59.9
Name: INCCAT, dtype: object
In [22]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])

)

df_inccat
Out[22]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x144994d60>
In [23]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)

)

df_inccat
Out[23]:
TURNFEAR  INCCAT 
0         90-100     0.297296
          60-79.9    0.174841
          40-59.9    0.143146
          0-20       0.140343
          21-39.9    0.135933
          80-89.9    0.108441
1         0-20       0.288125
          21-39.9    0.256327
          40-59.9    0.228856
          60-79.9    0.132598
          90-100     0.048886
          80-89.9    0.045209
Name: INCCAT, dtype: float64
In [24]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")

)

df_inccat
Out[24]:
TURNFEAR  INCCAT 
0         90-100     0.297296
          60-79.9    0.174841
          40-59.9    0.143146
          0-20       0.140343
          21-39.9    0.135933
          80-89.9    0.108441
1         0-20       0.288125
          21-39.9    0.256327
          40-59.9    0.228856
          60-79.9    0.132598
          90-100     0.048886
          80-89.9    0.045209
Name: frequency, dtype: float64
In [25]:
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .to_frame()

)

df_inccat
Out[25]:
frequency
TURNFEAR INCCAT
0 90-100 0.297296
60-79.9 0.174841
40-59.9 0.143146
0-20 0.140343
21-39.9 0.135933
80-89.9 0.108441
1 0-20 0.288125
21-39.9 0.256327
40-59.9 0.228856
60-79.9 0.132598
90-100 0.048886
80-89.9 0.045209
In [26]:
# Create a DataFrame df_inccat that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset
inccat_dict = {
    1: "0-20",
    2: "21-39.9",
    3: "40-59.9",
    4: "60-79.9",
    5: "80-89.9",
    6: "90-100",
}

df_inccat = (
    df["INCCAT"]
    .replace(inccat_dict)
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .to_frame()
    .reset_index()

)

df_inccat
Out[26]:
TURNFEAR INCCAT frequency
0 0 90-100 0.297296
1 0 60-79.9 0.174841
2 0 40-59.9 0.143146
3 0 0-20 0.140343
4 0 21-39.9 0.135933
5 0 80-89.9 0.108441
6 1 0-20 0.288125
7 1 21-39.9 0.256327
8 1 40-59.9 0.228856
9 1 60-79.9 0.132598
10 1 90-100 0.048886
11 1 80-89.9 0.045209

Income Categories: Side-by-Side Bar Chart

In [27]:
# Create bar chart of `df_inccat`
sns.barplot(
    x="INCCAT", # Income categories are in the correct order along the x-axis
    y="frequency",
    hue="TURNFEAR", # Set hue to "TURNFEAR"
    data=df_inccat,
    order=inccat_dict.values()
)
plt.xlabel("Income Category") # Label x-axis "Income Category"
plt.ylabel("Frequency (%)")   # Label y-axis "Frequency (%)"
plt.title("Income Distribution: Credit Fearful vs. Non-fearful"); # Title "Income Distribution: Credit Fearful vs. Non-fearful"

Comparing the income categories across the fearful and non-fearful groups, I can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.

So, based on all this, what do I know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense because young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.

Assets¶

Not all the data is demographic, though. If I was working for a bank, I would probably care less about how old the people are, and more about their ability to carry more debt. If I was going to build a model for that, I'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.

First, I'll zoom out a little bit. I've been looking at only the people who answered "yes" when the survey asked about "TURNFEAR", but what if I looked at everyone instead? To begin with, I'll bring in a clear dataset and run a single correlation.

Assets vs Home Value: Whole Dataset

In [28]:
# Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole dataset df
asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("SCF: Asset Houses Correlation:", asset_house_corr)
SCF: Asset Houses Correlation: 0.5198273544779252

That's a moderate positive correlation, which I would probably expect. For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our TURNFEAR subset, though? I'll run that correlation to see if there's a difference.

Assets vs Home Value: Credit Fearful

In [29]:
# Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole credit-fearful subset df_fear
asset_house_corr = df_fear["ASSET"].corr(df_fear["HOUSES"])
print("Credit Fearful: Asset Houses Correlation:", asset_house_corr)
Credit Fearful: Asset Houses Correlation: 0.5832879735979152

They're different. It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for my TURNFEAR group than it is for the population as a whole.

I'll make correlation matrices using the rest of the data for both df and df_fear and see if the differences persist. Here, I'll look at only 5 features: "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".

Correlation Matrix: Whole Dataset

In [30]:
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols]
corr.head()
Out[30]:
ASSET HOUSES INCOME DEBT EDUC
0 2153600.0 1100000.0 67195.781504 0.0 12
1 2116200.0 1100000.0 57014.602488 0.0 12
2 2145000.0 1100000.0 51924.012980 0.0 12
3 2552500.0 1100000.0 41742.833964 0.0 12
4 2176200.0 1100000.0 50905.895078 0.0 12
In [31]:
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols].corr()
corr
Out[31]:
ASSET HOUSES INCOME DEBT EDUC
ASSET 1.000000 0.519827 0.622429 0.261250 0.116673
HOUSES 0.519827 1.000000 0.247852 0.266661 0.169300
INCOME 0.622429 0.247852 1.000000 0.114646 0.069400
DEBT 0.261250 0.266661 0.114646 1.000000 0.054179
EDUC 0.116673 0.169300 0.069400 0.054179 1.000000
In [32]:
# Make a correlation matrix using df, considering only the columns "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC"
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols].corr()
corr.style.background_gradient(axis=None)
Out[32]:
  ASSET HOUSES INCOME DEBT EDUC
ASSET 1.000000 0.519827 0.622429 0.261250 0.116673
HOUSES 0.519827 1.000000 0.247852 0.266661 0.169300
INCOME 0.622429 0.247852 1.000000 0.114646 0.069400
DEBT 0.261250 0.266661 0.114646 1.000000 0.054179
EDUC 0.116673 0.169300 0.069400 0.054179 1.000000

Correlation Matrix: Credit Fearful

In [33]:
# Make a correlation matrix using df_fear
corr = df_fear[cols].corr()
corr.style.background_gradient(axis=None)
Out[33]:
  ASSET HOUSES INCOME DEBT EDUC
ASSET 1.000000 0.583288 0.722074 0.474658 0.113536
HOUSES 0.583288 1.000000 0.264099 0.962629 0.160348
INCOME 0.722074 0.264099 1.000000 0.172393 0.133170
DEBT 0.474658 0.962629 0.172393 1.000000 0.177386
EDUC 0.113536 0.160348 0.133170 0.177386 1.000000

There are some pretty important differences here. The relationship between "DEBT" and "HOUSES" is positive for both datasets, but while the coefficient for df is fairly weak at 0.26, the same number for df_fear is 0.96.

The closer a correlation coefficient is to 1.0, the more exactly they correspond. In this case, that means the value of the primary residence and the total debt held by the household is getting pretty close to being the same. This suggests that the main source of debt being carried by my "TURNFEAR" folks is their primary residence, which, again, is an intuitive finding.

"DEBT" and "ASSET" share a similarly striking difference, as do "EDUC" and "DEBT" which, while not as extreme a contrast as the other, is still big enough to catch the interest of our hypothetical banker.

I'll make some visualizations to show these relationships graphically.

Education¶

First, I'll start with education levels "EDUC", comparing credit fearful and non-credit fearful groups.

Education: Credit Fearful vs Credit Fearless

In [34]:
df_educ = (
    df["EDUC"]
)
df_educ.head()
Out[34]:
0    12
1    12
2    12
3    12
4    12
Name: EDUC, dtype: int64
In [35]:
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
)
df_educ
Out[35]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x144d1aa00>
In [36]:
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
)
df_educ.head()
Out[36]:
TURNFEAR  EDUC
0         12      0.257481
          8       0.192029
          13      0.149823
          9       0.129833
          14      0.096117
Name: EDUC, dtype: float64
In [37]:
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
)
df_educ.head()
Out[37]:
TURNFEAR  EDUC
0         12      0.257481
          8       0.192029
          13      0.149823
          9       0.129833
          14      0.096117
Name: frequency, dtype: float64
In [38]:
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .to_frame()
)
df_educ.head()
Out[38]:
frequency
TURNFEAR EDUC
0 12 0.257481
8 0.192029
13 0.149823
9 0.129833
14 0.096117
In [39]:
# Create a DataFrame df_educ that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset
df_educ = (
    df["EDUC"]
    .groupby(df["TURNFEAR"])
    .value_counts(normalize=True)
    .rename("frequency")
    .to_frame()
    .reset_index()
)
df_educ.head()
Out[39]:
TURNFEAR EDUC frequency
0 0 12 0.257481
1 0 8 0.192029
2 0 13 0.149823
3 0 9 0.129833
4 0 14 0.096117

Education: Side-by-Side Bart Chart

In [40]:
# Create bar chart of `df_educ`
sns.barplot(
    x="EDUC", # Make sure that the education categories are in the correct order along the x-axis
    y="frequency",
    hue="TURNFEAR", # Set hue to "TURNFEAR"
    data=df_educ
)
plt.xlabel("Education Level")  # Label the x-axis "Education Level"
plt.ylabel("Frequency (%)")    # Label the y-axis "Frequency (%)"
plt.title("Educational Attainment: Credit Fearful vs. Non-fearful"); # Title "Educational Attainment: Credit Fearful vs. Non-fearful"

In this plot, I can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.

Debt¶

I'll keep going with some scatter plots that look at debt.

Assets vs Debt: Whole Dataset

In [41]:
# Use df to make a scatter plot showing the relationship between DEBT and ASSET
df.plot.scatter(x="DEBT", y="ASSET");

Assets vs Debt: Credit Fearful

In [42]:
# Use df_fear to make a scatter plot showing the relationship between DEBT and ASSET
df_fear.plot.scatter(x="DEBT", y="ASSET");

I can see relationship in my df_fear graph is flatter than the one in my df graph, but they clearly are different.

I'll end with the most striking difference from our matrices, and make some scatter plots showing the difference between HOUSES and DEBT.

Home Value vs Debt: Whole Dataset

In [43]:
# Use df to make a scatter plot showing the relationship between HOUSES and DEBT
df.plot.scatter(x="DEBT", y="HOUSES");

And I'll make the same scatter plot using df_fear.

Home Value vs Debt: Credit Fearful

In [44]:
# Use df_fear to make a scatter plot showing the relationship between HOUSES and DEBT
df_fear.plot.scatter(x="DEBT", y="HOUSES");

The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: my df_fear graph shows an almost perfect linear relationship, while my df graph shows something a little more muddled. I also noticed that the datapoints on the df_fear graph form several little groups. Those are called "clusters," and I'll be talking more about how to analyze clustered data in the next project.