Survey of Consumer Finances: Unsupervised Clustering
Part 1: Exploring the Data
__author__ = "Donald Ghazi"
__email__ = "donald@donaldghazi.com"
__website__ = "donaldghazi.com"
In this project, I'm going to work with data from the Survey of Consumer Finances (SCF). The SCF is a survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about families in the United States. The survey is conducted every three years, and I'll work with an extract of the results from 2019.
GOALS
Subset my data to focus on "credit fearful" households.
Explore demographics information like age, race, education, income, and assets.
Create side-by-side bar charts to highlight differences between "credit fearful" and non-credit fearful households.
Prepare Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
First, I need to load the data, which is stored in a compressed CSV file: SCFP2019.csv.gz
. In the last project, I decompressed files using gzip
and the command line. However, pandas read_csv
function can work with compressed files directly.
Import Data
# Read the file "data/SCFP2019.csv.gz" into the DataFrame df
df = pd.read_csv("data/SCFP2019.csv.gz")
print("df shape:", df.shape)
df.head()
df shape: (28885, 351)
YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 11 | 6119.779308 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 2 | 10 | 6 | 6 | 3 | 3 |
1 | 1 | 12 | 4712.374912 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
2 | 1 | 13 | 5145.224455 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
3 | 1 | 14 | 5297.663412 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 2 | 6 | 2 | 1 | 10 | 4 | 4 | 2 | 2 |
4 | 1 | 15 | 4761.812371 | 2 | 75 | 6 | 12 | 4 | 2 | 0 | ... | 5 | 3 | 6 | 3 | 1 | 10 | 5 | 5 | 2 | 2 |
5 rows × 351 columns
One of the first things I notice here is that this dataset is HUGE — over 20,000 rows and 351 columns! SO MUCH DATA!!! I won't have time to explore all of the features in this dataset, but I can look at the links to the official Code Book. For now, I'll just say that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the United States.
For this project, I'm going to focus on households that have "been turned down for credit or feared being denied credit in the past 5 years." These households are identified in the "TURNFEAR"
column.
Subset Data
# Use a mask to subset create df to only households that have been turned down or feared being turned down for credit ("TURNFEAR" == 1)
mask = df["TURNFEAR"] == 1
mask.head(10)
0 False 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True 9 True Name: TURNFEAR, dtype: bool
mask.sum()
4623
mask = df["TURNFEAR"] == 1
df_fear = df[mask] # Assign the subset to the variable name df_fear
print("df_fear shape:", df_fear.shape)
df_fear.head()
df_fear shape: (4623, 351)
YY1 | Y1 | WGT | HHSEX | AGE | AGECL | EDUC | EDCL | MARRIED | KIDS | ... | NWCAT | INCCAT | ASSETCAT | NINCCAT | NINC2CAT | NWPCTLECAT | INCPCTLECAT | NINCPCTLECAT | INCQRTCAT | NINCQRTCAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2 | 21 | 3790.476607 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
6 | 2 | 22 | 3798.868505 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 3 | 2 | 2 |
7 | 2 | 23 | 3799.468393 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
8 | 2 | 24 | 3788.076005 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
9 | 2 | 25 | 3793.066589 | 1 | 50 | 3 | 8 | 2 | 1 | 3 | ... | 1 | 2 | 1 | 2 | 1 | 1 | 4 | 4 | 2 | 2 |
5 rows × 351 columns
Now that I have my subset, I'll explore the characteristics of this group. One of the features is age group ("AGECL"
).
Age Groups
df_fear["AGECL"].head()
5 3 6 3 7 3 8 3 9 3 Name: AGECL, dtype: int64
df_fear["AGECL"].nunique()
6
# Create a list age_groups with the unique values in the "AGECL" column
age_groups = df_fear["AGECL"].unique()
print("Age Groups:", age_groups)
Age Groups: [3 5 1 2 4 6]
Looking at the Code Book we can see that "AGECL"
represents categorical data, even though the values in the column are numeric.
This simplifies data storage, but it's not very human-readable. So before I create a visualization, I'll create a version of this column that uses the actual group names.
Replace Age Groups
df_fear["AGECL"].head(10)
5 3 6 3 7 3 8 3 9 3 110 5 111 5 112 5 113 5 114 5 Name: AGECL, dtype: int64
# Create a Series agecl that contains the observations from "AGECL" using the true group names
agecl_dict = {
1: "Under 35",
2: "35-44",
3: "45-54",
4: "55-64",
5: "65-74",
6: "75 or Older",
}
age_cl = df_fear["AGECL"].replace(agecl_dict)
age_cl.head(10)
5 45-54 6 45-54 7 45-54 8 45-54 9 45-54 110 65-74 111 65-74 112 65-74 113 65-74 114 65-74 Name: AGECL, dtype: object
Now that I have better labels, I'll make a bar chart and see the age distribution of my group.
Age Group Bar Chart
age_cl.head()
5 45-54 6 45-54 7 45-54 8 45-54 9 45-54 Name: AGECL, dtype: object
age_cl.value_counts()
Under 35 1307 35-44 1093 45-54 932 55-64 745 65-74 401 75 or Older 145 Name: AGECL, dtype: int64
# Create a bar chart showing the value counts from age_cl
age_cl_value_counts = age_cl.value_counts()
# Bar plot of `age_cl_value_counts`
age_cl.value_counts().plot(
kind="bar",
xlabel="Age Group", # Label the x-axis "Age Group"
ylabel="Frequency (count)", # Label the y-axis "Frequency (count)"
title="Credit Fearful: Age Groups" # Title "Credit Fearful: Age Groups".
);
I noticed that by creating their own age groups, the authors of the survey have basically made a histogram for us comprised of 6 bins. My chart is telling me that many of the people who fear being denied credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to look inside those values to get a more granular understanding of the data.
To do that, I'll need to look at a different variable: "AGE"
. Whereas "AGECL"
was a categorical variable, "AGE"
is continuous, so I can use it to make a histogram of my own.
Age Histogram
# Create a histogram of the "AGE" column with 10 bins
df_fear["AGE"].hist(bins=10)
plt.xlabel("Age") # Label the x-axis "Age"
plt.ylabel("Frequency (count)") # Label the y-axis "Frequency (count)"
plt.title("Credit Fearful: Age Distribution"); # Title "Credit Fearful: Age Distribution"
It looks like younger people are still more concerned about being able to secure a loan than older people, but the people who are most concerned seem to be between 30 and 40.
Now that I have an understanding of how age relates to my outcome of interest, I'll try some other possibilities, starting with race. If I look at the Code Book for "RACE"
, I can see that there are 4 categories.
Note that there's no 4
category here. If a value for 4 did exist, it would be reasonable to assign it to "Asian American / Pacific Islander" — a group that doesn't seem to be represented in the dataset. This dataset doesn't provide a complete picture of race in America.
Race Bart Chart:Credit Fearful
race_dict = {
1: "White/Non-Hispanic",
2: "Black/African-American",
3: "Hispanic",
5: "Other",
}
race = df_fear["RACE"].replace(race_dict)
race.head(10)
5 White/Non-Hispanic 6 White/Non-Hispanic 7 White/Non-Hispanic 8 White/Non-Hispanic 9 White/Non-Hispanic 110 White/Non-Hispanic 111 White/Non-Hispanic 112 White/Non-Hispanic 113 White/Non-Hispanic 114 White/Non-Hispanic Name: RACE, dtype: object
race_dict = {
1: "White/Non-Hispanic",
2: "Black/African-American",
3: "Hispanic",
5: "Other",
}
race = df_fear["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize=True)
race_value_counts
White/Non-Hispanic 0.539477 Black/African-American 0.237508 Hispanic 0.162232 Other 0.060783 Name: RACE, dtype: float64
# Create a horizontal bar chart showing the normalized value counts for "RACE"
race_dict = {
1: "White/Non-Hispanic",
2: "Black/African-American",
3: "Hispanic",
5: "Other",
}
race = df_fear["RACE"].replace(race_dict) # Replace the numerical values with the true group names
race_value_counts = race.value_counts(normalize=True)
# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)") # Label the x-axis "Frequency (%)"
plt.ylabel("Race") # Label the y-axis "Race"
plt.title("Credit Fearful: Racial Groups"); # Title "Credit Fearful: Racial Groups"
This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically about what I'm seeing, that might be because there are more White/Non-Hispanic in the population of the United States than there are other racial groups, and the sample for this survey was specifically drawn to be representative of the population as a whole.
Race Bar Chart: Whole Dataset
# Recreate the horizontal bar chart I just made using the entire dataset df instead of the subset df_fear
race = df["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize=True)
# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)")
plt.ylabel("Race")
plt.title("SCF Respondents: Racial Groups"); # Title of this plot should be "SCF Respondents: Racial Groups"
How does this second bar chart change the perception of the first one? On the one hand, I can see that White Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful respondents. In other words, Black and Hispanic households are actually more likely to be in the credit fearful group.
What about income level? Are people with lower incomes concerned about being denied credit, or is that something people with more money worry about? In order to answer that question, I'll need to again compare the entire dataset with my subgroup using the "INCCAT"
feature, which captures income percentile groups. This time, though, I'll make a single, side-by-side bar chart.
Income Categories: Credit Fearful vs Credit Fearless
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.to_frame()
.reset_index()
)
df_inccat
TURNFEAR | INCCAT | frequency | |
---|---|---|---|
0 | 0 | 90-100 | 0.297296 |
1 | 0 | 60-79.9 | 0.174841 |
2 | 0 | 40-59.9 | 0.143146 |
3 | 0 | 0-20 | 0.140343 |
4 | 0 | 21-39.9 | 0.135933 |
5 | 0 | 80-89.9 | 0.108441 |
6 | 1 | 0-20 | 0.288125 |
7 | 1 | 21-39.9 | 0.256327 |
8 | 1 | 40-59.9 | 0.228856 |
9 | 1 | 60-79.9 | 0.132598 |
10 | 1 | 90-100 | 0.048886 |
11 | 1 | 80-89.9 | 0.045209 |
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
)
df_inccat.head()
0 40-59.9 1 40-59.9 2 40-59.9 3 21-39.9 4 40-59.9 Name: INCCAT, dtype: object
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
)
df_inccat
<pandas.core.groupby.generic.SeriesGroupBy object at 0x144994d60>
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
)
df_inccat
TURNFEAR INCCAT 0 90-100 0.297296 60-79.9 0.174841 40-59.9 0.143146 0-20 0.140343 21-39.9 0.135933 80-89.9 0.108441 1 0-20 0.288125 21-39.9 0.256327 40-59.9 0.228856 60-79.9 0.132598 90-100 0.048886 80-89.9 0.045209 Name: INCCAT, dtype: float64
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
)
df_inccat
TURNFEAR INCCAT 0 90-100 0.297296 60-79.9 0.174841 40-59.9 0.143146 0-20 0.140343 21-39.9 0.135933 80-89.9 0.108441 1 0-20 0.288125 21-39.9 0.256327 40-59.9 0.228856 60-79.9 0.132598 90-100 0.048886 80-89.9 0.045209 Name: frequency, dtype: float64
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.to_frame()
)
df_inccat
frequency | ||
---|---|---|
TURNFEAR | INCCAT | |
0 | 90-100 | 0.297296 |
60-79.9 | 0.174841 | |
40-59.9 | 0.143146 | |
0-20 | 0.140343 | |
21-39.9 | 0.135933 | |
80-89.9 | 0.108441 | |
1 | 0-20 | 0.288125 |
21-39.9 | 0.256327 | |
40-59.9 | 0.228856 | |
60-79.9 | 0.132598 | |
90-100 | 0.048886 | |
80-89.9 | 0.045209 |
# Create a DataFrame df_inccat that shows the normalized frequency for income categories for both the credit fearful and non-credit fearful households in the dataset
inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}
df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.to_frame()
.reset_index()
)
df_inccat
TURNFEAR | INCCAT | frequency | |
---|---|---|---|
0 | 0 | 90-100 | 0.297296 |
1 | 0 | 60-79.9 | 0.174841 |
2 | 0 | 40-59.9 | 0.143146 |
3 | 0 | 0-20 | 0.140343 |
4 | 0 | 21-39.9 | 0.135933 |
5 | 0 | 80-89.9 | 0.108441 |
6 | 1 | 0-20 | 0.288125 |
7 | 1 | 21-39.9 | 0.256327 |
8 | 1 | 40-59.9 | 0.228856 |
9 | 1 | 60-79.9 | 0.132598 |
10 | 1 | 90-100 | 0.048886 |
11 | 1 | 80-89.9 | 0.045209 |
Income Categories: Side-by-Side Bar Chart
# Create bar chart of `df_inccat`
sns.barplot(
x="INCCAT", # Income categories are in the correct order along the x-axis
y="frequency",
hue="TURNFEAR", # Set hue to "TURNFEAR"
data=df_inccat,
order=inccat_dict.values()
)
plt.xlabel("Income Category") # Label x-axis "Income Category"
plt.ylabel("Frequency (%)") # Label y-axis "Frequency (%)"
plt.title("Income Distribution: Credit Fearful vs. Non-fearful"); # Title "Income Distribution: Credit Fearful vs. Non-fearful"
Comparing the income categories across the fearful and non-fearful groups, I can see that credit fearful households are much more common in the lower income categories. In other words, the credit fearful have lower incomes.
So, based on all this, what do I know? Among the people who responded that they were indeed worried about being approved for credit after having been denied in the past five years, a plurality of the young and low-income had the highest number of respondents. That makes sense because young people tend to make less money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes them more anxious about the future.
Not all the data is demographic, though. If I was working for a bank, I would probably care less about how old the people are, and more about their ability to carry more debt. If I was going to build a model for that, I'd want to establish some relationships among the variables, and making some correlation matrices is a good place to start.
First, I'll zoom out a little bit. I've been looking at only the people who answered "yes" when the survey asked about "TURNFEAR"
, but what if I looked at everyone instead? To begin with, I'll bring in a clear dataset and run a single correlation.
Assets vs Home Value: Whole Dataset
# Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole dataset df
asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("SCF: Asset Houses Correlation:", asset_house_corr)
SCF: Asset Houses Correlation: 0.5198273544779252
That's a moderate positive correlation, which I would probably expect. For many Americans, the value of their primary residence makes up most of the value of their total assets. What about the people in our TURNFEAR
subset, though? I'll run that correlation to see if there's a difference.
Assets vs Home Value: Credit Fearful
# Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole credit-fearful subset df_fear
asset_house_corr = df_fear["ASSET"].corr(df_fear["HOUSES"])
print("Credit Fearful: Asset Houses Correlation:", asset_house_corr)
Credit Fearful: Asset Houses Correlation: 0.5832879735979152
They're different. It's still only a moderate positive correlation, but the relationship between the total value of assets and the value of the primary residence is stronger for my TURNFEAR
group than it is for the population as a whole.
I'll make correlation matrices using the rest of the data for both df
and df_fear
and see if the differences persist. Here, I'll look at only 5 features: "ASSET"
, "HOUSES"
, "INCOME"
, "DEBT"
, and "EDUC"
.
Correlation Matrix: Whole Dataset
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols]
corr.head()
ASSET | HOUSES | INCOME | DEBT | EDUC | |
---|---|---|---|---|---|
0 | 2153600.0 | 1100000.0 | 67195.781504 | 0.0 | 12 |
1 | 2116200.0 | 1100000.0 | 57014.602488 | 0.0 | 12 |
2 | 2145000.0 | 1100000.0 | 51924.012980 | 0.0 | 12 |
3 | 2552500.0 | 1100000.0 | 41742.833964 | 0.0 | 12 |
4 | 2176200.0 | 1100000.0 | 50905.895078 | 0.0 | 12 |
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols].corr()
corr
ASSET | HOUSES | INCOME | DEBT | EDUC | |
---|---|---|---|---|---|
ASSET | 1.000000 | 0.519827 | 0.622429 | 0.261250 | 0.116673 |
HOUSES | 0.519827 | 1.000000 | 0.247852 | 0.266661 | 0.169300 |
INCOME | 0.622429 | 0.247852 | 1.000000 | 0.114646 | 0.069400 |
DEBT | 0.261250 | 0.266661 | 0.114646 | 1.000000 | 0.054179 |
EDUC | 0.116673 | 0.169300 | 0.069400 | 0.054179 | 1.000000 |
# Make a correlation matrix using df, considering only the columns "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC"
cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]
corr = df[cols].corr()
corr.style.background_gradient(axis=None)
ASSET | HOUSES | INCOME | DEBT | EDUC | |
---|---|---|---|---|---|
ASSET | 1.000000 | 0.519827 | 0.622429 | 0.261250 | 0.116673 |
HOUSES | 0.519827 | 1.000000 | 0.247852 | 0.266661 | 0.169300 |
INCOME | 0.622429 | 0.247852 | 1.000000 | 0.114646 | 0.069400 |
DEBT | 0.261250 | 0.266661 | 0.114646 | 1.000000 | 0.054179 |
EDUC | 0.116673 | 0.169300 | 0.069400 | 0.054179 | 1.000000 |
Correlation Matrix: Credit Fearful
# Make a correlation matrix using df_fear
corr = df_fear[cols].corr()
corr.style.background_gradient(axis=None)
ASSET | HOUSES | INCOME | DEBT | EDUC | |
---|---|---|---|---|---|
ASSET | 1.000000 | 0.583288 | 0.722074 | 0.474658 | 0.113536 |
HOUSES | 0.583288 | 1.000000 | 0.264099 | 0.962629 | 0.160348 |
INCOME | 0.722074 | 0.264099 | 1.000000 | 0.172393 | 0.133170 |
DEBT | 0.474658 | 0.962629 | 0.172393 | 1.000000 | 0.177386 |
EDUC | 0.113536 | 0.160348 | 0.133170 | 0.177386 | 1.000000 |
There are some pretty important differences here. The relationship between "DEBT"
and "HOUSES"
is positive for both datasets, but while the coefficient for df
is fairly weak at 0.26, the same number for df_fear
is 0.96.
The closer a correlation coefficient is to 1.0, the more exactly they correspond. In this case, that means the value of the primary residence and the total debt held by the household is getting pretty close to being the same. This suggests that the main source of debt being carried by my "TURNFEAR"
folks is their primary residence, which, again, is an intuitive finding.
"DEBT"
and "ASSET"
share a similarly striking difference, as do "EDUC"
and "DEBT"
which, while not as extreme a contrast as the other, is still big enough to catch the interest of our hypothetical banker.
I'll make some visualizations to show these relationships graphically.
First, I'll start with education levels "EDUC"
, comparing credit fearful and non-credit fearful groups.
Education: Credit Fearful vs Credit Fearless
df_educ = (
df["EDUC"]
)
df_educ.head()
0 12 1 12 2 12 3 12 4 12 Name: EDUC, dtype: int64
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
)
df_educ
<pandas.core.groupby.generic.SeriesGroupBy object at 0x144d1aa00>
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
)
df_educ.head()
TURNFEAR EDUC 0 12 0.257481 8 0.192029 13 0.149823 9 0.129833 14 0.096117 Name: EDUC, dtype: float64
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
)
df_educ.head()
TURNFEAR EDUC 0 12 0.257481 8 0.192029 13 0.149823 9 0.129833 14 0.096117 Name: frequency, dtype: float64
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.to_frame()
)
df_educ.head()
frequency | ||
---|---|---|
TURNFEAR | EDUC | |
0 | 12 | 0.257481 |
8 | 0.192029 | |
13 | 0.149823 | |
9 | 0.129833 | |
14 | 0.096117 |
# Create a DataFrame df_educ that shows the normalized frequency for education categories for both the credit fearful and non-credit fearful households in the dataset
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize=True)
.rename("frequency")
.to_frame()
.reset_index()
)
df_educ.head()
TURNFEAR | EDUC | frequency | |
---|---|---|---|
0 | 0 | 12 | 0.257481 |
1 | 0 | 8 | 0.192029 |
2 | 0 | 13 | 0.149823 |
3 | 0 | 9 | 0.129833 |
4 | 0 | 14 | 0.096117 |
Education: Side-by-Side Bart Chart
# Create bar chart of `df_educ`
sns.barplot(
x="EDUC", # Make sure that the education categories are in the correct order along the x-axis
y="frequency",
hue="TURNFEAR", # Set hue to "TURNFEAR"
data=df_educ
)
plt.xlabel("Education Level") # Label the x-axis "Education Level"
plt.ylabel("Frequency (%)") # Label the y-axis "Frequency (%)"
plt.title("Educational Attainment: Credit Fearful vs. Non-fearful"); # Title "Educational Attainment: Credit Fearful vs. Non-fearful"
In this plot, I can see that a much higher proportion of credit-fearful respondents have only a high school diploma, while university degrees are more common among the non-credit fearful.
I'll keep going with some scatter plots that look at debt.
Assets vs Debt: Whole Dataset
# Use df to make a scatter plot showing the relationship between DEBT and ASSET
df.plot.scatter(x="DEBT", y="ASSET");
Assets vs Debt: Credit Fearful
# Use df_fear to make a scatter plot showing the relationship between DEBT and ASSET
df_fear.plot.scatter(x="DEBT", y="ASSET");
I can see relationship in my df_fear
graph is flatter than the one in my df
graph, but they clearly are different.
I'll end with the most striking difference from our matrices, and make some scatter plots showing the difference between HOUSES
and DEBT
.
Home Value vs Debt: Whole Dataset
# Use df to make a scatter plot showing the relationship between HOUSES and DEBT
df.plot.scatter(x="DEBT", y="HOUSES");
And I'll make the same scatter plot using df_fear
.
Home Value vs Debt: Credit Fearful
# Use df_fear to make a scatter plot showing the relationship between HOUSES and DEBT
df_fear.plot.scatter(x="DEBT", y="HOUSES");
The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear enough: my df_fear
graph shows an almost perfect linear relationship, while my df
graph shows something a little more muddled. I also noticed that the datapoints on the df_fear
graph form several little groups. Those are called "clusters," and I'll be talking more about how to analyze clustered data in the next project.