Overview¶
In this project we acquired graduation rate data on campuses within the California State University system using the Integrated Postsecondary Education Data System developed by the National Center for Education Statistics. Specifically, we obtained rates for students that graduated within 150% of normal graduating time for their particular Bachelor’s degree for years 2009 and 2019.
Below we investigate if there is a statistically significant change in graduation rates among CSU campuses recently (2019) as compared to ten years prior (2009).
Preliminaries: load required modules¶
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nonparametric_stats as nps
import sklearn.svm as svm
import sklearn.metrics as mt
Build data frames¶
DATAFILE_2019 = '/home/thugwithyoyo/Documents/NullExitProjects/GradRateData/CSU_GradRates/CSV_2019.csv'
DATAFILE_2009 = '/home/thugwithyoyo/Documents/NullExitProjects/GradRateData/CSU_GradRates/CSV_2009.csv'
GR2019_df = pd.read_csv(DATAFILE_2019, header=0)
GR2009_df = pd.read_csv(DATAFILE_2009, header=0)
Check columns headers¶
GR2019_df.columns
GR2009_df.columns
Inspect dataframe¶
GR2009_df
GR2019_df.info
colName_2009 = GR2009_df.columns[4]
colName_2009
colName_2019 = GR2019_df.columns[4]
colName_2019
grRatios = GR2019_df[colName_2019] / GR2009_df[colName_2009]
grRatios = grRatios[~np.isnan(grRatios)]
fullInstNames = GR2019_df["institution name"]
fullInstNames
d = np.array(["B", "Stan", "SB", "C", "DH", "Fres", "Full", "EB",
"LB", "LA", "N", "Sac", "Mar", "SM", "MB", "CI"])
codedInstNames = pd.Series(name="coded names", data=d)
codedInstNames
GR2009_df['coded names'] = codedInstNames.values
GR2009_df
GR2019_df['coded names'] = codedInstNames.values
Plot recent and past rates on identity scatter¶
xLims = np.array([20, 75])
yLims = xLims
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(9,9)
ax.plot([xLims[0], xLims[1]], [yLims[0], yLims[1]], '--',
color='gray', alpha=0.5)
ax.scatter(GR2009_df[colName_2009], GR2019_df[colName_2019], marker='.')
for i, txt in enumerate(GR2009_df["coded names"]):
ax.annotate(txt,
(GR2009_df[colName_2009][i], GR2019_df[colName_2019][i]),
size=15
)
ax.set_aspect(1)
ax.set_xlabel('2009 Graduation rate (%)',size=17)
ax.set_xlim(xLims)
ax.set_ylabel('2019 Graduation rate (%)',size=17)
ax.set_ylim(yLims)
ax.set_title("CSU Graduation Rates by Campus over 10-year span (2009, 2019)\nBachelor\'s Degree within 150% of normal time\n", size=20)
Generate histogram and Q-Q plot of rate ratios¶
fig2, ax2 = plt.subplots(nrows=1, ncols=2)
fig2.set_size_inches(11,5)
#ax2[0].hist(grRatios)
nps.histPlotter(10, grRatios, axes=ax2[0])
ax2[0].set_xlabel('rate ratio', size=17)
ax2[0].set_ylabel('count', size=17)
ax2[0].set_title('Distrib. of Grad. Rate Ratios', size=20)
nps.qqPlotter_normal(grRatios, 10, axes=ax2[1])
ax2[1].set_xlabel('data quantiles', size=17)
ax2[1].set_ylabel('theoretical normal quantiles', size=17)
ax2[1].set_title('Q-Q comparison plot', size=20)
From the plots above, we are not convinced that graduation rate ratios are normally-distributed. To determine statistical significance of the observed increase in recent rates, the best approach might be to employ a non-parametric bootstrap of studentized hypothesis test or confidence limits.