Prediction Of Post-Collegiate Earnings And Debt

3y ago
19 Views
2 Downloads
345.14 KB
6 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Mollie Blount
Transcription

Prediction of Post-Collegiate Earnings and DebtMonica Agrawal, Priya Ganesan, Keith WyngardenStanford UniversityI.IntroductionBackgroundThe U.S. Department of Education launched CollegeScorecard in September 2015 as a means of gatheringmore data on degree-granting institutions, the demographics of college students, and the status of alumniof these institutions [1]. By doing so, the U.S. Department of Education hopes to empower students to makemore informed college decisions through a data-drivenapproach.Considering the soaring cost of higher education aswell as the accompanying rise of student debt, prospective students can greatly benefit from such information.However, College Scorecard has faced scrutiny due to itsomission of over 700 colleges, particularly communitycolleges, in its data set [2]. Hence, applying machinelearning to fill in omissions in the data set, particularlyrelated to earnings and debt, and finding correlationsbetween characteristics of colleges and the future successof their alumni has great value to society.Despite the relevance of machine learning to thisissue, fairly little research has been done in this area.Machine learning has been used in several related topics,such as predicting corporate earnings and predictingincome based on census data about individuals [3, 4].However, no research has been conducted on using college data to predict the earnings and debt of its alumni,potentially because higher-education institutions do notcondone a solely numbers-based approach to the collegeselection process.GoalsWe hope to use a variety of machine learning modelsto make predictions regarding post-college earningsand debt of alumni who were on federal financial aidfrom various institutions based on factors that reflectthe current status of each institution, such as majorsand degrees offered, tuition, and admissions rates. Suchstatistics are easier to obtain than post-college earnings,so our predictions can be used to fill in gaps in the current data set and potentially unearth interesting factorsthat influence alumni earnings and debt. In addition,alumni earnings can be compared with tuition costs andaverage student debt to determine the typical interestand length of student loans for a particular school.Previous WorkAs College Scorecard is a newly-released data set andis more comprehensive than past college data sets, notmuch analysis has been done on College Scorecard oreven on the topic of predicting post-collegiate earningsand debt. The most relevant past work in this area wasconducted in the late 1980s and early 1990s.Brewer et al. looked at the effect of college qualityon future earnings based on individual and family characteristics of high school students entering into college,and found that elite private institutions had a higherreturn on investment in terms of future wages [5]. Jameset al. attempted to predict future earnings (for only malecollege graduates) using a mix of individual studentinformation, institutional information, individual college experience variables, and labor market variables [6].They found a general trend that selective private schoolson the East Coast generally correlated to higher futureearnings, but also found that the college experience variables contributed to the majority of the variance in thedata. Hence, they concluded that each individual’s college experience, and what each individual makes of theopportunities at his or her college, is the best indicatorof future earnings. Lewis C. Solmon, one of the mostwidely-cited experts in this field, performed a studyon what features determine college quality and whatimpact college quality has on earnings [7]. He used regression analysis to find that variables like college level,average S.A.T. scores, and average faculty salaries droveup alumni earnings the most.While these papers have made large strides in using machine learning to understand what fuels alumniearnings, and were very careful in avoiding bias with respect to minority communities and other similar factors,they also have some shortcomings. All of these studieswere based off of individual alumni data (things likepersonal and family background, individual major, etc);no one has yet attempted to predict alumni earnings anddebt solely based off of anonymized institutional data.Furthermore, these studies focused on the most eliteinstitutions and did not provide analysis on smaller andlesser-known institutions, which are the organizationsthat could most benefit from a study like ours.As we were working with a new dataset, there werea number of data quality issues to resolve. These arelargely detailed in the following section, but of particularnote are metrics that had partially missing data (onlysome schools had listed values). There is is ample re1

search on missing data problems in machine learning;Marlin (2008) gives an overview of major methods [8].The most useful family of methods for our dataset isstatistical imputation, which is detailed in Rubin (1996)in the context of an overview of multiple imputation [9].We will return to these papers in the next section.II.Data and Feature Set PreprocessingDataCollege Scorecard provides a publicly available dataset consisting of approximately 2000 metrics for 7805degree-granting institutions [1]. These metrics includedemographic data, test scores, family income data, dataabout the percentages of students in each major, financial aid information, debt and debt repayment values,earnings of alumni several years after graduation, andmore. We chose to focus on the 2011 data set becauseit was the least sparse data set in the last five years(more future earnings information was available thanfor more recent years). Our first tasks were to selectvariables to predict, transform the dataset into pairs offeatures and prediction variables, and segment the datafor evaluation purposes.Selecting Features and Prediction ValuesWe chose two values for our prediction variables – themedian postgraduate debt and the median postgraduateearnings for alumni 6 years after graduation. We thenwent through several steps to prune the full feature setto an initial feature list.We first eliminated all features that had nonnumerical/categorical values (primarily school name).Additionally, we removed unrelated features that shouldnot be used to make predictions, such as features thatprovided the number of students in different data collection cohorts.We also removed all features related to debt, earnings, and repayment. All metrics in these categoriesare highly correlated with the two we chose to predict,so they would be weighted very strongly compared toother features and would hurt the ability of our modelsto generalize to schools without any of this informationavailable, which is the motivation for this project.Finally, after the preprocessing steps listed aboveand the non-standard data value processing describedbelow, we removed all features (mostly null-indicatorsand unused categories) which had only one value for allexamples, as they offer no predictive power.Preprocessing Non-Standard Data ValuesSome features in the data set were categorical fields;we chose to turn each category into separate indicatorfeatures. Many values in the data set were listed as2"NULL", and a portion of these were meaningful (for example, indicating the absence of a binary feature) ratherthan indicative of missing data. In order to transformthe nulls into usable numeric values while preservingthe original meaning of the nulls, we replaced each nullvalue with 0 and created an extra feature for each featurethat contained null values. This new feature used 1sand 0s to indicate whether the value in the previousfeature was null or non-null. For categorical fields thatcontained null values, we created just one null indicatorfeature in addition to the category indicators describedpreviously.Handling Privacy-Suppressed ValuesAll values in our dataset that were computed using datafrom fewer than 30 students were listed as "PrivacySuppressed". Privacy-suppressed values are more common for smaller schools than larger schools and manyprivacy-suppressed values occurred in potentially usefulmetrics. One approach for handling these values was tosimply remove all features with any privacy-suppressedentries. However, discarding hundreds of features in thisfashion, especially for features with a low percentage ofprivacy-suppressed values, was undesirable.In Marlin’s overview of approaches to missing data,alternatives to case deletion (the above strategy) includemean imputation (setting missing values to the mean ofobserved values), regression imputation (learning regression models based on observed values), and the class ofmultiple imputation solutions (sampling multiple valuesfrom a simpler/generalized model over observed valuesand running analyses on each for later aggregation) [8].We determined that mean imputation was not appropriate in this case, since many features of schools varysignificantly based on school size, degree level, and soon.We implemented regression imputation by training alinear regression model (with ordinary least squares costfunction) to the fully-observed features with respect toeach feature with privacy-suppressed values. To avoidtraining these models with limited data, we imposed arequirement that imputed features must have missingdata for less than 30% of schools. We then replacedthe missing values with predictions of the appropriatemodel. This is a single imputation method (though sincethe model cost function is convex, it is very similar tomultiple imputation methods with this same choice ofmodel). As noted by Rubin, multiple imputation methods capture variability of the data that is lost with singleimputation [9]. Future work might involve using moregeneralized models for imputation, such as a mixture ofGaussians, and running multiple imputation.

Selecting Training and Testing ExamplesWe removed all examples (schools) that were missing thevalues for our two label variables: median postgraduatedebt and median postgraduate earnings 6 years aftergraduation (among the provided options of 6, 8, and10 years post-graduation, 6 years had the least sparsedata). From the remaining examples, we set aside 3500for training, 1000 for development, and the rest ( 800)for testing.III.Prediction Models and MethodologyLinear RegressionWe pose our learning task as a regression problem: givena processed list of features for a school, we would liketo predict real values for that school’s students’ mediandebt at graduation and median earnings 6 years aftergraduation. Linear regression is a natural choice ofbaseline model for regression problems, so we first ransimple linear regression on the full feature set (includingimputation of privacy-suppressed features), using our3500 training examples and 1000 development examples.The performance of this baseline was 12.97% mean absolute percent error (average of the absolute values ofpercent error made on each soon) on the developmentset for earnings and 20.20% for debt. In addition totuning the number of privacy-suppressed features toinclude in the feature set, we saw two avenues for lowering this error: pruning the feature space and enablingour model to learn nonlinear relationships between thefeatures and earnings/debt.Feature SelectionAfter data preprocessing and statistical imputation ofprivacy-suppressed values, 599 features remained. Thisis a large number of features in comparison to the training set size of 3500 schools, especially as we moved fromsimple linear regression to more complex models. Wetherefore explored the use of feature selection to shrinkthe number of input features.To select the most important features to keep, we ransequential forward-based feature selection on our 3500training examples, using our median earnings predictionvariable and median debt prediction variable in turn toevaluate and select the most relevant features [10]. Features were selected based on their mean-squared error,using 10-fold-cross-validation, and selection was terminated at the point where the prediction error stabilized.This procedure yielded 170 features for earnings prediction and 165 features for debt prediction, with 70features in common.The top 5 features yielded after running statisticalimputation and feature selection were:Table 1: Top features for median earnings and debt.We also tried using PCA on the school/feature datamatrix to transform the data into a smaller set of uncorrelated model inputs. After full optimization undereach approach, a model using PCA performed onlyslightly worse than a model using forward-based feature selection. However, the use of PCA for featureselection would require collection of data for all features when adding new schools to the dataset, since theprincipal components need to be recomputed when thedata matrix grows. By contrast, after running featureselection on the existing dataset, adding new schools tothe dataset requires collecting data only for the selectedfeature subset. If too many new schools were added, thefeature selection results may become outdated. However,since the number of examples in our task is limited bythe number of colleges in the United States, and sincethe initial dataset is fairly comprehensive and the rate ofschool closures/openings is low compared to the totalnumber of institutions, this is not a major concern.Locally Weighted Linear RegressionTo capture local nonlinearities between the features anddebt/earnings, we added local weighting to the costfunction for our linear regression model. Using the Euclidean norm, our weight function for a training examplex (i) with respect to an input example x was:w(i ) exp x (i) x 2 τ2!To make the Euclidean distance (the norm in theequation above) meaningful, we standardized features3

to zero mean and unit variance prior to computing theweights. The parameter τ in the weighting functionabove was tuned on the development set data for various other model choices (feature selection, inclusion ofprivacy-suppressed values).Figures 1 and 2 show the results of tuning τ for eachoutput variable and model. We found that the best linear regression model on the development set used localweighting, feature selection, and imputation of privacysuppressed values.Figure 3 shows the performance of KNN regressionon the development set across values of k. Inversedistance weighting outperformed uniform weighting,giving evidence that school with similar graduate earnings and debt are clustering in our feature space, butKNN with optimal k had higher error than the bestweighted linear regression model.Figure 3: k values plotted against percent error formedian earnings and debt.Figure 1: τ values plotted against percent error formedian earnings.Figure 2: τ values plotted against percent error formedian debt.KNN RegressionWe also used the non-parametric k-nearest-neighborsmodel in order to capture nonlinearities in predictionof debt and earnings, using imputation of privacysuppressed values and the same data standardizationtechnique used for weighted linear regression. The KNNalgorithm predicts debt and earnings as a weighted combination of debt and earnings of an input’s k nearest (defined here as Euclidean distance) neighbors. The weighting schemes tried were uniform weights and weightsproportional to inverse distance.4Capturing Nonlinearities Among FeaturesLastly, we explored using models that can automaticallycapture nonlinear relationships among the variables, inaddition to nonlinearities between the variables and outputs.First, we used a support vector machine with datastandardization and feature selection to make predictions. We used the RBF kernel and L2-regularized L1loss support vector regression; L2-regularized L2-losssupport vector regression yielded similar results. Wetuned our regularization term coefficients on the development set and found 0.000003 and 0.00000007 to be theoptimal parameters for earnings and debt, respectively.We also trained simple neural networks with a singlehidden layer, using the previous feature selection andimputation for privacy-suppressed values [11]. A singlehidden layer was chosen because there was insufficienttraining data (number of schools) to fit a model withmore parameters without significant overfitting. Thenetwork is trained using the Levenberg-Marquadt algorithm for minimization with the logistic function asthe activation function, and it uses a randomly held-outset from the training set as a validation set and ceasestraining when improvement on the held-out set plateaus.The number of nodes in the neural network wastuned by examining the performance on the development set; results were mostly consistent for networks upto 10 nodes, after which the network suffered an overfitting problem. A hidden layer with 4 nodes performedoptimally for debt with 19.36% error, and a hidden layer

with 6 hidden nodes performed optimally for earningswith 11.74% error.IV.ResultsTables 2 and 3 show the test set performance of the optimized (with respect to the development set) model fromeach class for earnings and debt. The primary errormetrics were mean absolute percent error, which penalizes errors of different sizes and directions equally, andRMSE (root mean squared error), which penalizes largerdeviations superlinearly. For the best model under bothmetrics, weighted linear regression, the R2 measure between predicted and actual values in the test set was0.9079 for earnings and 0.9221 for debt. Our absoluteerror is lower for debt than for earnings, but since thedollar amounts for debt are typically lower than thoseof earnings, we have a higher percentage error for debtprediction.Table 2: Error for earnings across all models.Table 3: Error for debt across all models.V.DiscussionOverall, much of the variance in earnings and debt information was in fact captured by the static school dataprovided in College Scorecard. Our incremental modelselection process showed that regression imputationof privacy-suppressed values improved overall performance. In addition, local weighting helped adapt linearregression to nonlinear relationships between schoolcharacteristics and graduate debt/earnings. The number of training examples is limited by the number ofschools, but feature selection helped constrain the complexity of our models in this setting. In addition, test setperformance was very similar to development set performance, so optimizing our model parameters through thedevelopment set did not lead to excessive overfitting.Support vector regression did worse than all othermodels, even after optimization of regularization parameters. The test set performance was only marginallyworse than errors for the training and development sets,so overfitting was not an issue. This indicates that learning decision boundaries in our kernelized feature spaceis not very helpful for the values we want to predict.Several selected features relate to socioeconomicbackgrounds of the student population. The CollegeScorecard data set included earning and debt data subdivided by background, but most of this data was privacysuppressed. For future work, partnering with the U.S.Department of Education to gain access to this datacould help provide more accurate or individualized estimates.If we examine the predictions made by weighted linear regression for median earnings, approximately 40%of the test set schools had predictions within 1,000 ofthe true value, and almost 90% of schools had predictions within 5,000 of the true value, meaning that thefunction did well for the majority of schools. However,ten of the schools had absolute percent errors above 50%;in examining these schools, the majority only instructedspecialized skills, e.g. cosmetology, massage therapy.Therefore, it seems like the current algorithm has trouble extending to trade schools, in which future debt andearnings may be best characterized by a different set offeatu

Machine learning has been used in several related topics, . features and prediction variables, and segment the data for evaluation purposes. Selecting Features and Prediction Values . the nulls into usable numeric values while preserving the original meaning of the nulls, we replaced each null .

Related Documents:

– Payout Ratio p: dividend/earnings DPS/EPS – Retained Earnings: (earnings - dividends) – Plowback Ratio b: retained earnings/total earnings – Book Value BV: cumulative retained earnings – Return on Book Equity ROE: earnings/BV Using these concepts, different valuation formulas may be derived

_Payout ratio: dividend/earnings DPS/EPS p _Retained earnings: (earnings - dividends) _Plowback ratio: retained earnings/total earnings b _Book value (BV): cumulative retained earnings _Return on book equity (ROE): earnings/BV 16 EPS and ROE

German Collegiate Programming Contest 2017 German Collegiate Programming Contest 2017 TheGCPC2017Jury 01.07.2016. German Collegiate Programming Contest 2017 Statistics. German Collegiate Programming Contest 2017 Statistics Problem MinLOC MaxLOC Borders 48 337 Buildings 26 90 Joyride 46 84 PantsonFire 30 97

financial reporting in terms of quality of earnings, that is, the degree to which reported earnings reflect economic reality. Penman (2003) submitted that high quality earnings are the earnings that contain a good indicator for future earnings, with regard to the current performance of the company. In other words, quality of earnings currently .

STRIVE Collegiate Academy STRIVE Collegiate Academy Charter School received its charter in the summer of 2014, and opened to 120 students in July 2015. The school expanded each year until it served grades 5-8 in 2018. Vision STRIVE Collegiate Academy (STRIVE) will cultivate college-ready high school graduates who will achieve post-secondary

Livingston Collegiate Academy 7301 Dwyer Rd New Orleans, LA 476 99% Collegiate Baton Rouge 6180 Winbourne Ave Baton Rouge, LA 238 87% Rosenwald Collegiate Academy 3819 Herschel St New Orleans, LA [temp.] 145 90% Opportunities Academy 2625 Thalia St New Orleans, LA 70113 40 93% Collegiate Academies (Network) .

Basic earnings per share 9–29 Earnings 12–18 Shares 19–29 Diluted earnings per share 30–63 Earnings 33–35 Shares 36–40 Dilutive potential ordinary shares 41–63 Options, warrants and their equivalents 45–48 Convertible instruments 49–51

AppSec-DC 2012 Hacking NET Applications: The Black Arts Jon McCoy www.DigitalBodyGuard.com