Stock Market Price Prediction Using Linear And Polynomial Regression Models

1y ago
4 Views
1 Downloads
711.32 KB
6 Pages
Last View : 7d ago
Last Download : 3m ago
Upload by : Eli Jorgenson
Transcription

Stock Market Price Prediction Using Linear and Polynomial Regression ModelsLucas NunnoUniversity of New MexicoComputer Science DepartmentAlbuquerque, New Mexico, United Stateslnunno@cs.unm.edudifferent machine learning techniques. The project allowstechniques for real-world machine learning applications including acquiring and analyzing a large data set and usinga variety of techniques to train the program and predictpotential outcomes.Abstract—The following paper describes the work that wasdone on investigating applications of regression techniques onstock market price prediction. The report describes the linearand polynomial regression methods that were applied alongwith the accuracies obtained using these methods. It was foundthat support vector regression was the most effective out of themodels used, although there are opportunities to expand thisresearch further using additional techniques and parametertuning.III. R ELATED W ORKKeywords-stock market; regression; machine learning;A variety of methods have been used to predict stockprices using machine learning. Some of the more interestingareas of research include using a type of reinforcementlearning called Q-learning [5] and using US’s export/importgrowth, earnings for consumers, and other industry data tobuild a decision tree to determine if a stock’s price will riseor fall [3].The Q-learning approach has been shown to be effective,but it is unclear how computationally intensive the algorithmis due to the large number of state alphas that must begenerated. The decision tree approach may be particularlyuseful when analyzing a specific industry’s growth. Therehas also been research done as to how top-performing stocksare defined and selected [7] and analysis on what cango wrong when modeling the stock market with machinelearning [4].I. I NTRODUCTIONThe stock market is known to be a complex adaptivesystem that is difficult to predict due to the large numberof factors that determine the day to day price changes. Wedo this in machine learning through regression which tries todetermine the relationship between a dependent variable andone or more independent variables. Here, the independentvariables are the features and the dependent variable thatwe would like to predict is the price. It is apparent thatthe features that we are using are not truly independent,we know that the volume and outstanding shares are notindependent as well as the closing price and the returnon investment not being independent. However, this is anassumption that we are making to simplify the model inorder to use the chosen regression models.This study aims to use linear and polynomial regression models to predict price changes and evaluate differentmodels’ success by withholding data during training andevaluating the accuracy of these predictions using knowndata.This research concerns closing prices of stocks, thereforeday trading was not modeled. The model for the stockmarket was only concerned with the closing price for stocksat the end of a business day, high-frequency trading is anarea of active research, but this study preferred a simplifiedmodel of the stock market.IV. M ETHODSA. Data RepresentationThe dataset that was used was collected from the CRSPUS Stock Database [2] as a collection of comma-separatedvalues where each row consisted of a stock on a specific dayalong with data on the volume, shares out, closing price, andother features for that day in time.The Python scientific computing library numpy was usedalong with the data analysis library pandas in order toconvert these CSV files into pandas DataFrames that wereindexed by date. Each specific stock is a view of the masterDataFrame that is filtered based on that stock’s ticker. Thisallowed efficient access to stocks of interest and convienientaccess to date ranges.These stock DataFrame views are then used as the datato be fed into our regression black boxes.II. M OTIVATIONStock market price prediction is a problem that has thepotential to be worth billions of dollars and is activelyresearched by the largest financial corporations in the world.It is a significant problem because it has no clear solution,although attempts can be made at approximation using many1

as the S&P 500, which is an index of the 500 largestcompanies traded on the NYSE or the NASDAQ. The CRSPdataset that has been provided contains on the order of 5000companies, so this data is filtered as it is loaded into themaster DataFrame to avoid excessive memory usage.Figure 1: Data-flow of the program showing how stock data turnsinto prediction value vectors.D. Regression Model EvaluationB. Prediction through RegressionThere are a number of scoring methods for regression thatare implemented in scikit learn, such as explained variancescore and mean squared error.The regression process is done through the scikit-learn[1] machine learning library. This is the core for the priceprediction functionality. There are some additional steps thatmust be done so that the data can be fed into the regressionalgorithms and return plausible results. In particular, everytraining dataset must be normalized to a Gaussian normallydistributed or normal-looking distribution between -1 and 1before the input matrix is fit to the chosen regression model.1) Data Normalization: There are a couple importantdetails to note about the way the data must be preprocessedin order to be fit into regression models. Firstly, dates arenormally represented as strings of the format ”YYYY-MMDD” when it comes to database storage. This format mustbe converted to a single integer in order to be used as acolumn in the feature matrix.This is done by using the date’s ordinal value. In Python,this is quite simple. The columns in the DataFrame are storedas numpy datetime64 objects, which must be converted tovanilla Python datetime objects which are in turn convertedto an integer using the toordinal() built-in function fordatetime objects.Each column in the feature matrix is then scaled usingscikit learn’s scale() function from the preprocessing module. Note that this is a very important step, as prior to this,the polynomial regression methods would return questionable results since it is documented that the scikit learn’snon-linear regression models assume normally distributeddata as the input for feature matrices.2) Types of Regression Models: The price predictionfunction provides a few regression models that can be chosento perform the prediction. This includes1) Linear Regression2) Stochastic Gradient Descent (SGD)3) Support Vector Regression (SVR)Please note that these were the regression models thatwere evaluated, not all had promising results. The resultssection goes into detail of the difficulties faced with eachregression model and the attempts at the solutions.The linear regression method initially seemed to beworking well, but there were some difficulties using thepolynomial regression methods, since the predictions thatare being returned did not look like a non-linear fitting.Figure 2: An illustration of the random sampling of both stocksand dates done by the software in order to obtain error metrics.These metrics are then compared against other regression models’results to evaluate their performance.While these algorithms were investigated, it seemed to bemore beneficial for this problem domain to implement meanabsolute percentage error and use this in order to comparestocks of significantly different prices. Mean absolute errorwas used in the context of inspecting a single stock, whereprice difference was bound not to vary as much as comparingto disparate companies.V. R ESULTSNote that the following diagrams consistently use Apple’s(AAPL) stock prices from 2006-11-16 to 2007-03-27. This isstrictly for comparison reasons and to consistently compareregression methods with the same data. This date rangeC. Stock and Date SelectionThe stocks that are used in this study are a subset of thestocks that are publicly traded on the US market known2

was chosen specifically for its troughs and plateau featurespresent in the date range since it would provide a sufficientlychallenging topography for the regression models and forhuman experts as well. The evaluation described later usesa random sampling of stock tickers and dates.Large training windows appeared to overfit for largerprediction windows, as can be seen by Figure 4. However,it appeared to be more accurate in instances where the pricedeltas were consistent with the price trends over that sameperiod for relatively short buying periods over a couple ofweeks as seen in Figure 3.A. Linear RegressionB. Stochastic Gradient Descent (SGD)Linear regression was less sensitive to normalization techniques as opposed to the polynomial regression techniques.Some plausible results were appearing early on in the studyeven when a small number of features were used withoutnormalization, while this caused the polynomial regressionmodels to overflow. Linear regression also provided plausible results after normalization with no parameter tuningrequired due to its simplified model, although the accuracywas less than would be desired if relying on the results forportfolio building.At first, it appeared that Stochastic Gradient Descentwould be an appropriate fit to a problem of this type for longterm price prediction. However, since the dataset that wasused only covered the time period of 2005-2013 the trainingdata could only provide a maximum of (365 8) 2920training samples to be used. Obviously, the stock exchangeis not open every day of the year, therefore this numberwould be significantly lower. This appears to be a problemaccording to the algorithm’s documentation source, since itis only recommended to be used for problems with a trainingset size of greater than 10,000.The scikit-learn documentation mentions this with a fewsuggestions for alternatives. [1]The class SGDRegressor implements a plainstochastic gradient descent learning routine whichsupports different loss functions and penalties tofit linear regression models. SGDRegressor is wellsuited for regression problems with a large numberof training samples ( 10, 000), for other problems we recommend Ridge, Lasso, or ElasticNet.Further along in the paper, we will investigate some of thealternatives mentioned above, but this is also an opportunityfor future research on linear methods applied to this domain.Figure 3: Price prediction for the Apple stock 10 days in the futureusing Linear Regression.C. Support Vector Regression (SVR)It is interesting how well linear regression can predictprices when it has an ideal training window, as would be the90 day window as pictured above. Later we will comparethe results of this with the other methodsThe scikit-learn documentation has an illustrative figureof the differences of available kernels when using SupportVector Regression. Below is the figure that shows what kindof fitting is done using various kernels, note the differencebetween the radial basis function (RBF) kernel and the othertwo. [1]Figure 4: Price prediction for the Apple stock 45 days in the futureusing Linear Regression.3

Figure 5: Support Vector Regression data-fitting with a rbf, linear,and polynomial kernel on a set of normally distributed data withrandom noise introduced into the data-set.1) Using the Polynomial Kernel: The degree of thepolynomial is by default set to 3, this setting was used forthe radial basis function as well.Figure 8: Sample result of using the RBF kernel with the SVR.This data was trained on the previous 48 business day closing pricesand predicted the next 45 business day closing prices.The RBF kernel tended to fix this divergent behavior thatwe were consistently seeing with the polynomial kernel.However, it seemed to come at the cost of not as accuratepredictions at the beginning of the test data. Overall, the RBFkernel performed the best on average for each day that it wastested on. It is important to note that it doesn’t mean that itsresults were always the most accurate. This depended quiteheavily on the training window, as linear and polynomialregression were able to have more accurate predictionsthan SVR with the RBF kernel. However, SVR with theRBF kernel was the most consistent overall, therefore it’simportant to make this distinction.Figure 6: Sample result of using the polynomial kernel with theSVR. This data was trained on the previous 48 business day closingprices and predicted the next 45 business day closing prices.From the multiple trials performed, the polynomial kerneltended to have better predictions for a subset of the testingdata, but then would tend to diverge abruptly from theground truth at varying periods. This behavior can be seenabove, where it diverges around the February 28th 2007 datapoint.Figure 9: Window size comparison for SVR using the RBF kernel.Support vector regression with the rbf kernel was not verysensitive to window size changes, which is very differentthan linear and SVR with the polynomial kernel; which wereboth very sensitive to window size changes.Figure 7: Window size comparison for SVR using the polynomialkernel.D. Summary and ComparisonBelow is a superimposed version of all the regressionmethods discussed previously.2) Using the RBF Kernel:4

Figure 12: Mean absolute percentage error of a 5 day stock priceprediction for the three regression methods.Support Vector Regression with the RBF kernel performedthe best overall in the trials that we have run, with theLinear and SVR with the polynomial kernel varying moresignificantly. SVR with the RBF kernel had consistent shortterm results of 5% mean absolute percentage error(MAPE) while SVR with the polynomial kernel and linearregression had 10% MAPE on average.It is interesting to note that for a majority of the trials run,we have found that smaller training window sizes almostalways had better results, with some of the best accuraciesat around 50 prior dates.Figure 10: Comparison of several regression methods of a singlestock on a fixed time-frame and their training and testing modelsvisualized.The following is the error of each of the regressionmethods from the superimposed figure above. Note that errorhere is measured in dollar amount, not percentage as it willbe in the following figures.Figure 13: Mean absolute percentage error of a 180 day stockprice prediction for the three regression methods.The results for 180 day price predictions were chosen toprovide insight onto the performance of these algorithms fora longer period of time. Overall, SVR with the RBF kernelperformed the best, but it is interesting to note that SVRwith the polynomial performed better in comparison withthe rest of the algorithms on these longer time frames.Linear regression performed very poorly when its windowsize was small for long-term price prediction, but actuallyended up outperforming the other algorithms when theirwindow sizes tended to be on the order of years (365-400days).Figure 11: Mean absolute error for Figure 10.When comparing the regression, mean absolute percentage error was used because of the high variation of stockprices. This ensures that the results are not biased againststocks with higher prices, since the error is calculated as thepercentage of that stock price that the prediction was off by.VI. C ONCLUSION AND F UTURE W ORKSeveral issues have been addressed throughout the paper,including the process of feature selection, normalization, andtraining set window size. These issues warrant expansiveresearch on their own regard, but this research has done itsbest to mitigate these factors by using features collected bya top finance research institute [2] and algorithms that areprovided in an actively developed and maintained machinelearning library [1].The subset of CRSP data that was selected was notsubstantial enough on a global GDP level to create a decisiontree of buy/sell decisions based off the industry and sectordata as in [3], but future work could include data from5

other sources corroborated with the CRSP data to performregression and/or binary buy/sell classification for wholeindustries or sectors at once, if so desired.It is interesting how linear regression can perform betterthan polynomial methods at certain intervals due to thereduced chance of linear regression overfitting the trainingdata. In some cases, we found that for long term projectedmarket fluctuations linear regression performed well. Thiscase was especially true when a polynomial method wouldoverfit the training data and have increased performance atthe beginning of the testing data, but at the cost of veryinaccurate results in the later prediction dates. Conversely,linear regression was less accurate at the beginning of theprediction, but wouldn’t perform as badly as a polynomialregression method that diverged.An opportunity for future research also emerges from applying additional linear and polynomial regression methodsto this problem. This software suite was architected in such away that the regression is only at one critical point such thatdifferent regression algorithms can modularly be swapped inand out as needed. This is also true of the parameters that canbe used for these algorithms, more research could be donein this area since many of the parameters are embarrassinglydomain specific for regression models, which may result indrastic performance increases.Higher order polynomial regression methods were morelikely to overfit the training data than linear regression,and it is quite often the case that it is situational of whenthe right order of polynomial best fits the training datawithout overfitting. Often, it is only apparent after we knowthe ground truth of the prices, therefore it is difficult torecommend to use most of these models for any high stakesfinancial planning, be it personal finance or otherwise.[4] Hurwitz, E, and T Marwala. 2009. ”Common Mistakes whenApplying Computational Intelligence and Machine Learning toStock Market modelling.” University of Johannesburg Press.[5] Lee, Jae Won, Jonghun Park, Jangmin O, and Jongwoo Lee.2007. ”A Multiagent Approach to Q-Learning for Daily StockTrading.” IEEE TRANSACTIONS ON SYSTEMS, MAN,AND CYBERNETICS 864-877.[6] Wang, Yanshan, and In-Chan Choi. 2013. ”Market Indexand Stock Price Direction Prediction using Machine LearningTechniques: An empirical study on the KOSPI and HSI.”ScienceDirect (ScienceDirect) 1-13.[7] Yan, Robert, and Charles Ling. 2007. ”Machine Learningfor Stock Selection.” Industrial and Government Track ShortPaper Collection 1038-1042.ACKNOWLEDGMENTThe CRSP dataset was generously provided by the Department Chair of Finance at the University of New Mexico, Dr.Leslie Boni.I would also like to thank Dr. Trilce Estrada for providingguidance on the project and helping to motivate the variousregression techniques used above.R EFERENCES[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine LearningResearch, 12:2825–2830, 2011.[2] Center for Research in Security Prices The University ofChicago. Us stock databases. web, 2014.[3] C. Tsai, and S. Wang 2009. Stock Price Forecasting byHybrid Machine Learning Techniques. Proceedings of theInternational MultiConference of Engineers and ComputerScientists, 20-26, 2009.6

Figure 3: Price prediction for the Apple stock 10 days in the future using Linear Regression. It is interesting how well linear regression can predict prices when it has an ideal training window, as would be the 90 day window as pictured above. Later we will compare the results of this with the other methods Figure 4: Price prediction for the .

Related Documents:

1. BASIC INTRODUCTION OF STOCK MARKET A stock market is a public market for trading of company stocks. Stock market prediction is the task to find the future price of a company stock. The price of a share depends on the number of people who want to buy or sell it. If there are more buyers, then prices will rise. If the seller has a number of .

This research tries to see the influence of G7 and ASEAN-4 stock market on Indonesian stock market by using LASSO model. Stock market estimation method had been conducted such as Stock Market Forecasting Using LASSO Linear Regression Model (Roy et al., 2015) and Mali et al., (2017) on Open Price Prediction of Stock Market Using Regression Analysis.

Stock price prediction is regarded as one of most difficult task to accomplish in financial forecasting due to complex nature of stock market [1, 2, 3]. The desire of many . work are historical daily stock prices obtained from two countries stock exchanged. The data composed of four elements, namely: open price, low price, high price and

An ecient stock market prediction model using hybrid feature reduction method based on variational autoencoders and recursive feature elimination Hakan Gunduz* Introduction Financial prediction, especially stock market prediction, has been one of the most attrac - tive topics for researchers and investors over the last decade. Stock market .

The stock market is dynamic, non-stationary and complex in nature, the prediction of stock price index is a challenging task due to its chaotic and non linear nature. The prediction is a statement about the future and based on this prediction, investors can decide to invest or not to invest in the stock market [2]. Stock market may be

stock prices then an increase in the -rm s own stock price informativeness reduces the sensitivity of its investment to its peer stock price (prediction 1). Indeed, as the signal conveyed by its own . stock price (prediction 2), but not otherwise. The same prediction holds for an increase in the correlation of the fundamentals of a -rm .

the relationship between stock prices and these factors. Although these factors will temporarily change the stock price, in essence, these factors will be reflected in the stock price and will not change the long-term trend of the stock price. erefore, stock prices can be predicted simply with historical data.

Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on a financial exchange. The successful prediction of a stock's future price could yield significant profit. The stock market is not an efficient market.