Prediction Of Post-Collegiate Earnings And Debt

3y ago

19 Views

2 Downloads

345.14 KB

6 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Mollie Blount

Report this link

Download PDF

Transcription

Prediction of Post-Collegiate Earnings and DebtMonica Agrawal, Priya Ganesan, Keith WyngardenStanford UniversityI.IntroductionBackgroundThe U.S. Department of Education launched CollegeScorecard in September 2015 as a means of gatheringmore data on degree-granting institutions, the demographics of college students, and the status of alumniof these institutions [1]. By doing so, the U.S. Department of Education hopes to empower students to makemore informed college decisions through a data-drivenapproach.Considering the soaring cost of higher education aswell as the accompanying rise of student debt, prospective students can greatly benefit from such information.However, College Scorecard has faced scrutiny due to itsomission of over 700 colleges, particularly communitycolleges, in its data set [2]. Hence, applying machinelearning to fill in omissions in the data set, particularlyrelated to earnings and debt, and finding correlationsbetween characteristics of colleges and the future successof their alumni has great value to society.Despite the relevance of machine learning to thisissue, fairly little research has been done in this area.Machine learning has been used in several related topics,such as predicting corporate earnings and predictingincome based on census data about individuals [3, 4].However, no research has been conducted on using college data to predict the earnings and debt of its alumni,potentially because higher-education institutions do notcondone a solely numbers-based approach to the collegeselection process.GoalsWe hope to use a variety of machine learning modelsto make predictions regarding post-college earningsand debt of alumni who were on federal financial aidfrom various institutions based on factors that reflectthe current status of each institution, such as majorsand degrees offered, tuition, and admissions rates. Suchstatistics are easier to obtain than post-college earnings,so our predictions can be used to fill in gaps in the current data set and potentially unearth interesting factorsthat influence alumni earnings and debt. In addition,alumni earnings can be compared with tuition costs andaverage student debt to determine the typical interestand length of student loans for a particular school.Previous WorkAs College Scorecard is a newly-released data set andis more comprehensive than past college data sets, notmuch analysis has been done on College Scorecard oreven on the topic of predicting post-collegiate earningsand debt. The most relevant past work in this area wasconducted in the late 1980s and early 1990s.Brewer et al. looked at the effect of college qualityon future earnings based on individual and family characteristics of high school students entering into college,and found that elite private institutions had a higherreturn on investment in terms of future wages [5]. Jameset al. attempted to predict future earnings (for only malecollege graduates) using a mix of individual studentinformation, institutional information, individual college experience variables, and labor market variables [6].They found a general trend that selective private schoolson the East Coast generally correlated to higher futureearnings, but also found that the college experience variables contributed to the majority of the variance in thedata. Hence, they concluded that each individual’s college experience, and what each individual makes of theopportunities at his or her college, is the best indicatorof future earnings. Lewis C. Solmon, one of the mostwidely-cited experts in this field, performed a studyon what features determine college quality and whatimpact college quality has on earnings [7]. He used regression analysis to find that variables like college level,average S.A.T. scores, and average faculty salaries droveup alumni earnings the most.While these papers have made large strides in using machine learning to understand what fuels alumniearnings, and were very careful in avoiding bias with respect to minority communities and other similar factors,they also have some shortcomings. All of these studieswere based off of individual alumni data (things likepersonal and family background, individual major, etc);no one has yet attempted to predict alumni earnings anddebt solely based off of anonymized institutional data.Furthermore, these studies focused on the most eliteinstitutions and did not provide analysis on smaller andlesser-known institutions, which are the organizationsthat could most benefit from a study like ours.As we were working with a new dataset, there werea number of data quality issues to resolve. These arelargely detailed in the following section, but of particularnote are metrics that had partially missing data (onlysome schools had listed values). There is is ample re1

search on missing data problems in machine learning;Marlin (2008) gives an overview of major methods [8].The most useful family of methods for our dataset isstatistical imputation, which is detailed in Rubin (1996)in the context of an overview of multiple imputation [9].We will return to these papers in the next section.II.Data and Feature Set PreprocessingDataCollege Scorecard provides a publicly available dataset consisting of approximately 2000 metrics for 7805degree-granting institutions [1]. These metrics includedemographic data, test scores, family income data, dataabout the percentages of students in each major, financial aid information, debt and debt repayment values,earnings of alumni several years after graduation, andmore. We chose to focus on the 2011 data set becauseit was the least sparse data set in the last five years(more future earnings information was available thanfor more recent years). Our first tasks were to selectvariables to predict, transform the dataset into pairs offeatures and prediction variables, and segment the datafor evaluation purposes.Selecting Features and Prediction ValuesWe chose two values for our prediction variables – themedian postgraduate debt and the median postgraduateearnings for alumni 6 years after graduation. We thenwent through several steps to prune the full feature setto an initial feature list.We first eliminated all features that had nonnumerical/categorical values (primarily school name).Additionally, we removed unrelated features that shouldnot be used to make predictions, such as features thatprovided the number of students in different data collection cohorts.We also removed all features related to debt, earnings, and repayment. All metrics in these categoriesare highly correlated with the two we chose to predict,so they would be weighted very strongly compared toother features and would hurt the ability of our modelsto generalize to schools without any of this informationavailable, which is the motivation for this project.Finally, after the preprocessing steps listed aboveand the non-standard data value processing describedbelow, we removed all features (mostly null-indicatorsand unused categories) which had only one value for allexamples, as they offer no predictive power.Preprocessing Non-Standard Data ValuesSome features in the data set were categorical fields;we chose to turn each category into separate indicatorfeatures. Many values in the data set were listed as2"NULL", and a portion of these were meaningful (for example, indicating the absence of a binary feature) ratherthan indicative of missing data. In order to transformthe nulls into usable numeric values while preservingthe original meaning of the nulls, we replaced each nullvalue with 0 and created an extra feature for each featurethat contained null values. This new feature used 1sand 0s to indicate whether the value in the previousfeature was null or non-null. For categorical fields thatcontained null values, we created just one null indicatorfeature in addition to the category indicators describedpreviously.Handling Privacy-Suppressed ValuesAll values in our dataset that were computed using datafrom fewer than 30 students were listed as "PrivacySuppressed". Privacy-suppressed values are more common for smaller schools than larger schools and manyprivacy-suppressed values occurred in potentially usefulmetrics. One approach for handling these values was tosimply remove all features with any privacy-suppressedentries. However, discarding hundreds of features in thisfashion, especially for features with a low percentage ofprivacy-suppressed values, was undesirable.In Marlin’s overview of approaches to missing data,alternatives to case deletion (the above strategy) includemean imputation (setting missing values to the mean ofobserved values), regression imputation (learning regression models based on observed values), and the class ofmultiple imputation solutions (sampling multiple valuesfrom a simpler/generalized model over observed valuesand running analyses on each for later aggregation) [8].We determined that mean imputation was not appropriate in this case, since many features of schools varysignificantly based on school size, degree level, and soon.We implemented regression imputation by training alinear regression model (with ordinary least squares costfunction) to the fully-observed features with respect toeach feature with privacy-suppressed values. To avoidtraining these models with limited data, we imposed arequirement that imputed features must have missingdata for less than 30% of schools. We then replacedthe missing values with predictions of the appropriatemodel. This is a single imputation method (though sincethe model cost function is convex, it is very similar tomultiple imputation methods with this same choice ofmodel). As noted by Rubin, multiple imputation methods capture variability of the data that is lost with singleimputation [9]. Future work might involve using moregeneralized models for imputation, such as a mixture ofGaussians, and running multiple imputation.

Selecting Training and Testing ExamplesWe removed all examples (schools) that were missing thevalues for our two label variables: median postgraduatedebt and median postgraduate earnings 6 years aftergraduation (among the provided options of 6, 8, and10 years post-graduation, 6 years had the least sparsedata). From the remaining examples, we set aside 3500for training, 1000 for development, and the rest ( 800)for testing.III.Prediction Models and MethodologyLinear RegressionWe pose our learning task as a regression problem: givena processed list of features for a school, we would liketo predict real values for that school’s students’ mediandebt at graduation and median earnings 6 years aftergraduation. Linear regression is a natural choice ofbaseline model for regression problems, so we first ransimple linear regression on the full feature set (includingimputation of privacy-suppressed features), using our3500 training examples and 1000 development examples.The performance of this baseline was 12.97% mean absolute percent error (average of the absolute values ofpercent error made on each soon) on the developmentset for earnings and 20.20% for debt. In addition totuning the number of privacy-suppressed features toinclude in the feature set, we saw two avenues for lowering this error: pruning the feature space and enablingour model to learn nonlinear relationships between thefeatures and earnings/debt.Feature SelectionAfter data preprocessing and statistical imputation ofprivacy-suppressed values, 599 features remained. Thisis a large number of features in comparison to the training set size of 3500 schools, especially as we moved fromsimple linear regression to more complex models. Wetherefore explored the use of feature selection to shrinkthe number of input features.To select the most important features to keep, we ransequential forward-based feature selection on our 3500training examples, using our median earnings predictionvariable and median debt prediction variable in turn toevaluate and select the most relevant features [10]. Features were selected based on their mean-squared error,using 10-fold-cross-validation, and selection was terminated at the point where the prediction error stabilized.This procedure yielded 170 features for earnings prediction and 165 features for debt prediction, with 70features in common.The top 5 features yielded after running statisticalimputation and feature selection were:Table 1: Top features for median earnings and debt.We also tried using PCA on the school/feature datamatrix to transform the data into a smaller set of uncorrelated model inputs. After full optimization undereach approach, a model using PCA performed onlyslightly worse than a model using forward-based feature selection. However, the use of PCA for featureselection would require collection of data for all features when adding new schools to the dataset, since theprincipal components need to be recomputed when thedata matrix grows. By contrast, after running featureselection on the existing dataset, adding new schools tothe dataset requires collecting data only for the selectedfeature subset. If too many new schools were added, thefeature selection results may become outdated. However,since the number of examples in our task is limited bythe number of colleges in the United States, and sincethe initial dataset is fairly comprehensive and the rate ofschool closures/openings is low compared to the totalnumber of institutions, this is not a major concern.Locally Weighted Linear RegressionTo capture local nonlinearities between the features anddebt/earnings, we added local weighting to the costfunction for our linear regression model. Using the Euclidean norm, our weight function for a training examplex (i) with respect to an input example x was:w(i ) exp x (i) x 2 τ2!To make the Euclidean distance (the norm in theequation above) meaningful, we standardized features3

to zero mean and unit variance prior to computing theweights. The parameter τ in the weighting functionabove was tuned on the development set data for various other model choices (feature selection, inclusion ofprivacy-suppressed values).Figures 1 and 2 show the results of tuning τ for eachoutput variable and model. We found that the best linear regression model on the development set used localweighting, feature selection, and imputation of privacysuppressed values.Figure 3 shows the performance of KNN regressionon the development set across values of k. Inversedistance weighting outperformed uniform weighting,giving evidence that school with similar graduate earnings and debt are clustering in our feature space, butKNN with optimal k had higher error than the bestweighted linear regression model.Figure 3: k values plotted against percent error formedian earnings and debt.Figure 1: τ values plotted against percent error formedian earnings.Figure 2: τ values plotted against percent error formedian debt.KNN RegressionWe also used the non-parametric k-nearest-neighborsmodel in order to capture nonlinearities in predictionof debt and earnings, using imputation of privacysuppressed values and the same data standardizationtechnique used for weighted linear regression. The KNNalgorithm predicts debt and earnings as a weighted combination of debt and earnings of an input’s k nearest (defined here as Euclidean distance) neighbors. The weighting schemes tried were uniform weights and weightsproportional to inverse distance.4Capturing Nonlinearities Among FeaturesLastly, we explored using models that can automaticallycapture nonlinear relationships among the variables, inaddition to nonlinearities between the variables and outputs.First, we used a support vector machine with datastandardization and feature selection to make predictions. We used the RBF kernel and L2-regularized L1loss support vector regression; L2-regularized L2-losssupport vector regression yielded similar results. Wetuned our regularization term coefficients on the development set and found 0.000003 and 0.00000007 to be theoptimal parameters for earnings and debt, respectively.We also trained simple neural networks with a singlehidden layer, using the previous feature selection andimputation for privacy-suppressed values [11]. A singlehidden layer was chosen because there was insufficienttraining data (number of schools) to fit a model withmore parameters without significant overfitting. Thenetwork is trained using the Levenberg-Marquadt algorithm for minimization with the logistic function asthe activation function, and it uses a randomly held-outset from the training set as a validation set and ceasestraining when improvement on the held-out set plateaus.The number of nodes in the neural network wastuned by examining the performance on the development set; results were mostly consistent for networks upto 10 nodes, after which the network suffered an overfitting problem. A hidden layer with 4 nodes performedoptimally for debt with 19.36% error, and a hidden layer

with 6 hidden nodes performed optimally for earningswith 11.74% error.IV.ResultsTables 2 and 3 show the test set performance of the optimized (with respect to the development set) model fromeach class for earnings and debt. The primary errormetrics were mean absolute percent error, which penalizes errors of different sizes and directions equally, andRMSE (root mean squared error), which penalizes largerdeviations superlinearly. For the best model under bothmetrics, weighted linear regression, the R2 measure between predicted and actual values in the test set was0.9079 for earnings and 0.9221 for debt. Our absoluteerror is lower for debt than for earnings, but since thedollar amounts for debt are typically lower than thoseof earnings, we have a higher percentage error for debtprediction.Table 2: Error for earnings across all models.Table 3: Error for debt across all models.V.DiscussionOverall, much of the variance in earnings and debt information was in fact captured by the static school dataprovided in College Scorecard. Our incremental modelselection process showed that regression imputationof privacy-suppressed values improved overall performance. In addition, local weighting helped adapt linearregression to nonlinear relationships between schoolcharacteristics and graduate debt/earnings. The number of training examples is limited by the number ofschools, but feature selection helped constrain the complexity of our models in this setting. In addition, test setperformance was very similar to development set performance, so optimizing our model parameters through thedevelopment set did not lead to excessive overfitting.Support vector regression did worse than all othermodels, even after optimization of regularization parameters. The test set performance was only marginallyworse than errors for the training and development sets,so overfitting was not an issue. This indicates that learning decision boundaries in our kernelized feature spaceis not very helpful for the values we want to predict.Several selected features relate to socioeconomicbackgrounds of the student population. The CollegeScorecard data set included earning and debt data subdivided by background, but most of this data was privacysuppressed. For future work, partnering with the U.S.Department of Education to gain access to this datacould help provide more accurate or individualized estimates.If we examine the predictions made by weighted linear regression for median earnings, approximately 40%of the test set schools had predictions within 1,000 ofthe true value, and almost 90% of schools had predictions within 5,000 of the true value, meaning that thefunction did well for the majority of schools. However,ten of the schools had absolute percent errors above 50%;in examining these schools, the majority only instructedspecialized skills, e.g. cosmetology, massage therapy.Therefore, it seems like the current algorithm has trouble extending to trade schools, in which future debt andearnings may be best characterized by a different set offeatu

Machine learning has been used in several related topics, . features and prediction variables, and segment the data for evaluation purposes. Selecting Features and Prediction Values . the nulls into usable numeric values while preserving the original meaning of the nulls, we replaced each null .

Related Documents:

15.401 Finance Theory I, Equities - MIT OpenCourseWare

– Payout Ratio p: dividend/earnings DPS/EPS – Retained Earnings: (earnings - dividends) – Plowback Ratio b: retained earnings/total earnings – Book Value BV: cumulative retained earnings – Return on Book Equity ROE: earnings/BV Using these concepts, different valuation formulas may be derived

18 Views

2y ago

15.401 Finance Theory I

_Payout ratio: dividend/earnings DPS/EPS p _Retained earnings: (earnings - dividends) _Plowback ratio: retained earnings/total earnings b _Book value (BV): cumulative retained earnings _Return on book equity (ROE): earnings/BV 16 EPS and ROE

21 Views

2y ago

German Collegiate Programming Contest 2017

German Collegiate Programming Contest 2017 German Collegiate Programming Contest 2017 TheGCPC2017Jury 01.07.2016. German Collegiate Programming Contest 2017 Statistics. German Collegiate Programming Contest 2017 Statistics Problem MinLOC MaxLOC Borders 48 337 Buildings 26 90 Joyride 46 84 PantsonFire 30 97

8 Views

10m ago

Financial Reporting Quality and Shareholders' Wealth Maximization ...

financial reporting in terms of quality of earnings, that is, the degree to which reported earnings reflect economic reality. Penman (2003) submitted that high quality earnings are the earnings that contain a good indicator for future earnings, with regard to the current performance of the company. In other words, quality of earnings currently .

38 Views

1y ago

Student and Family Handbook 2021-2022 - Strive Collegiate Academy

STRIVE Collegiate Academy STRIVE Collegiate Academy Charter School received its charter in the summer of 2014, and opened to 120 students in July 2015. The school expanded each year until it served grades 5-8 in 2018. Vision STRIVE Collegiate Academy (STRIVE) will cultivate college-ready high school graduates who will achieve post-secondary

11 Views

5m ago

Csp Project Narrative:Tableof Contents Absolutepriority :Low ...

Livingston Collegiate Academy 7301 Dwyer Rd New Orleans, LA 476 99% Collegiate Baton Rouge 6180 Winbourne Ave Baton Rouge, LA 238 87% Rosenwald Collegiate Academy 3819 Herschel St New Orleans, LA [temp.] 145 90% Opportunities Academy 2625 Thalia St New Orleans, LA 70113 40 93% Collegiate Academies (Network) .

8 Views

5m ago

Indian Accounting Standard (Ind AS) 33 Earnings per Share

Basic earnings per share 9–29 Earnings 12–18 Shares 19–29 Diluted earnings per share 30–63 Earnings 33–35 Shares 36–40 Dilutive potential ordinary shares 41–63 Options, warrants and their equivalents 45–48 Convertible instruments 49–51

28 Views

2y ago

Hacking NET Applications: The Black Arts

AppSec-DC 2012 Hacking NET Applications: The Black Arts Jon McCoy www.DigitalBodyGuard.com

53 Views

3y ago

Recent Views

Legal Proceedings and Legal Privilege Exemptions: Myth-busting - ICO

If asking for legal advice, say so, and start new email chain If giving legal advice, say so Involve lawyers (before litigation contemplated) Maintain confidentiality of legal advice documents Limit dissemination of legal advice (need to know; original only) Make internal communications re legal advice factual

1y ago

240 Views

Smart People Ask for (My) Advice: Seeking Advice Boosts .

advice strategically is likely to be a different experi-ence for the advice seeker than seeking advice with the intention of using it, from the advisor’s perspec-tive, strategic advice seeking may elicit the same per-ceptual effects as authentic advice seeking because the advice seeker’s intentions (and her reliance on advice)

3y ago

177 Views

Legal Action Group The Role of Advice Services in Health Outcomes

The Role of Advice Services in Health Outcomes Evidence Review and Mapping Study June 2015 The Role of Advice Services in Health Outcomes . tor.!Our! r,!

1y ago

170 Views

Legal Information vs Legal Advice Guidelines - TMCEC

giving legal advice. Legal advice is a written or oral statement that: o Interprets some aspect of the law, court rules, or court procedures; o Recommends a specific course of conduct a person should take in an actual or potential legal proceeding; or o Applies the law to the individual person's specific factual circumstances. What is Legal .

1y ago

225 Views

ProQual L2 Certificate Supporting Access to Legal Advice

R/502/7657 Communicating with legal advice clients 2 3 D/503/0822 Supporting clients to make use of the legal advice service 2 3 R/502/7660 Enabling legal advice clients to access signposting and referral opportunities 2 3 Optional Units - a minimum of 6 credits Unit Reference Number Unit Title Unit Level Credit Value

1y ago

173 Views

Guidance for opponents in civil legal aid cases - Scottish Legal Aid Board

injury case - may apply for civil legal aid (since this leaﬂet deals only with civil legal aid, where we refer to "legal aid" we mean "civil legal aid"). Legal aid is ﬁnancial help from public funds. It helps people who qualify to get legal advice and the help of a solicitor to put their case in court.

4m ago

110 Views

Priority Banking Tariff - Standard Chartered

Foreign exchange rate Free Free Free Free Free Free Free Free Free Free Free Free Free Free Free SMS Banking Daily Weekly Monthly. in USD or in other foreign currencies in VND . IDD rates min. VND 85,000 Annual Rental Fee12 Locker size Small Locker size Medium Locker size Large Rental Deposit12,13 Lock replacement

2y ago

206 Views

legal and ethical dimensions of practice - Dovetail

Material in this Guide should never be taken as providing you or any other person with legal advice. Legal advice regarding the application of the law to a particular circumstance or situation can only come from a legal practitioner. A range of sources for legal advice can be found in the Guide.

1y ago

167 Views

How Social Welfare Legal Advice and Social Prescribing can work .

The position of social welfare legal advice and its role in London's recovery The Mayor of London and partners should position social welfare legal advice as a core pillar of Londons recovery from the OVID-19 pandemic, with a core focus on ensuring adequate funding and practical support for advice agencies to ensure ongoing viability.

1y ago

172 Views

WHAT TO DO IF YOU ARE SEXUALLY HARASSED

There are many legal clinics or legal information centres you can contact to obtain legal information, educational resources or legal referrals. Alberta Central Alberta Community Legal Clinic (Red Deer) Centre for Public Legal Education Alberta Pro Bono Law Alberta Women's Centre Legal Advice Clinic (Calgary)

3y ago

245 Views

Legal Advocacy Essentials

Legal Advocacy Essentials: a core training for legal advocates Presented by the Washington State Coalition Against Domestic Violence, 2008. This information is not intended as a substitute for legal advice. 1 Legal Advocacy Essentials . A core training for legal advocates . Table of Contents . What is a legal advocate?

1y ago

249 Views

Legal & Corporate Services: Strategic Plan - CP6

the provision of legal advice, managing legal risk and managing the legal supply chain. By doing this well, the team will move towards its vision. Legal Services is made up of 4 teams, each serving different customers with a dedicated legal resource. This is summarised in the figure right. Although Legal Services has customerdistinct, -focussed .

1y ago

171 Views

Regulatory Guide RG 90 Example Statement of Advice: Scaled advice for a .

representatives and advisers who give personal advice to retail clients. It explains how and why we have developed an example Statement of Advice (SOA) for scaled advice (i.e. personal advice that is limited in scope) on personal insurance for a new retail client. The example SOA was developed in consultation with stakeholders, and we

1y ago

186 Views

Removal of licence disqualification - Legal Aid WA

agencies, permission must first be obtained from Legal Aid Western Australia. This Kit provides information about the law only and does not constitute legal advice. You should seek legal advice if you have a specific legal problem. Every effort is made to ensure that the information contai

2y ago

253 Views

Legal Information vs - txcourts.gov

giving legal advice. Legal advice is a written or oral statement that: Inter p rets some as ect of th elaw, courtles, or du s; Recomme nd s a pecific c ourse of ndu ters h ld k ein an actual or ntial legal proceeding; or 'sApplies th elaw to individu alperso n seci fic actu circums a . What is Legal Information?

1y ago

174 Views

Prediction Of Post-Collegiate Earnings And Debt

It looks like you're using an ad-blocker