Behavior Based Learning In Identifying High Frequency .

2y ago
29 Views
2 Downloads
783.72 KB
8 Pages
Last View : 26d ago
Last Download : 3m ago
Upload by : Kairi Hasson
Transcription

Behavior Based Learning in Identifying High Frequency TradingStrategiesSteve Yang, Mark Paddrik, Roy Hayes, Andrew Todd, Andrei Kirilenko, Peter Beling, and William SchererAbstract— Electronic markets have emerged as popularvenues for the trading of a wide variety of financial assets,and computer based algorithmic trading has also asserteditself as a dominant force in financial markets across theworld. Identifying and understanding the impact of algorithmictrading on financial markets has become a critical issue formarket operators and regulators. We propose to characterizetraders’ behavior in terms of the reward functions most likely tohave given rise to the observed trading actions. Our approach isto model trading decisions as a Markov Decision Process (MDP),and use observations of an optimal decision policy to findthe reward function. This is known as Inverse ReinforcementLearning (IRL). Our IRL-based approach to characterizingtrader behavior strikes a balance between two desirable featuresin that it captures key empirical properties of order bookdynamics and yet remains computationally tractable. Using anIRL algorithm based on linear programming, we are able toachieve more than 90% classification accuracy in distinguishing high frequency trading from other trading strategies inexperiments on a simulated E-Mini S&P 500 futures market.The results of these empirical tests suggest that high frequencytrading strategies can be accurately identified and profiledbased on observations of individual trading actions.Keywords:Limit order book, Inverse Reinforcement Learning,Markov Decision Process, Maximum likelihood, Price impact,High Frequency Trading.I. I NTRODUCTIONMANY FINANCIAL MARKET PARTICIPANTS nowemploy algorithmic trading, commonly defined as theuse of computer algorithms to automatically make certaintrading decisions, submit orders, and manage those ordersafter submission. By the time of the “Flash Crash” (On May6, 2010 during 25 minutes, stock index futures, options, andexchange-traded funds experienced a sudden price drop ofmore than 5 percent, followed by a rapid and near completerebound), algorithmic trading was thought to be responsiblefor more than 70% of trading volume in the U.S. ([3],[9], [10], and [15]). Moreover, Kirilenko et al. [15] haveshown that the key events in the Flash Crash have a clearinterpretation in terms of algorithmic trading.A variety of machine learning techniques have been applied in financial market analysis and modeling to assistmarket operators, regulators, and policy makers to understandthe behaviors of the market participants, market dynamics,and the price discovery process of the new electronic marketphenomena of algorithmic trading([1], [2], [3], [4], [5], [6],Mr. Yang, Mr. Paddrik, Mr. Hayes, Mr. Todd, Dr. Beling, and Dr.Scherer are with the Department of Systems and Information Engineering at University of Virginia (email: {yy6a, mp3ua, rlh8t, aet6cd, pb3a,wts}@virginia.edu). Dr. Kirilenko is with the Commodity Futures TradingCommission (email: akirilenko@cftc.gov).and [8]). We propose modeling traders’ behavior as a MarkovDecision Process (MDP), using observations of individualtrading actions to characterize or infer trading strategies.More specifically, we aim to learn traders’ reward functionsin the context of multi-agent environments where traderscompete using fast algorithmic trading strategies to exploreand exploit market microstructure.Our proposed approach is based on a machine learningtechnique ([20], [21], and [23]) known as Inverse Reinforcement Learning (IRL) ([11], [12], [13], [18], and [22]). InIRL, one aims to infer the model that underlies solutionsthat have been chosen by decision makers. In this casethe reward function is of interest by itself in characterizingagent’s behavior irregardless of its circumstances. For example, Pokerbots can improve performance against suboptimalhuman opponents by learning reward functions that accountfor the utility of money, preferences for certain hands orsituations, and other idiosyncrasies [17]. Another objectivein IRL is to use observations of the traders’ actions to decideones’ own behaviors. It is possible in this case to directlylearn the reward functions from the past observations andbe able derive new policies based on the reward functionslearned in a new environment to govern a new autonomousprocess (apprenticeship learning). In this paper, we focus ourattention on the former problem to identify trader’s behaviorusing reward functions.The rest of the paper is structured as follows: In Section2, we define notation and formulate the IRL model. InSection 3, we first propose a concise MDP model of the limitorder book to obtain reward functions of different tradingstrategies, and then solve the IRL problem using a linearprogramming approach based on an assumption of rationaldecision making. In Section 4, we present our agent-basedsimulation model for E-Mini S&P 500 futures market andprovide validation results that suggest this model replicateswith high fidelity the real E-Mini S&P 500 futures market.Using this simulation model we generate simulated marketdata and perform two experiments. In the first experiment, weshow that we can reliably identify High Frequency Trading(HFT) strategies from other algorithmic trading strategiesusing IRL. In the second experiment, we apply IRL on HFTsand show that we can accurately identify a manipulative HFTstrategy (Spoofing) from the other HFT strategies. Section 5discusses the conclusion of this study and the future work.Electronic copy available at: http://ssrn.com/abstract 1955965

II. P ROBLEM F ORMULATION - I NVERSER EINFORCEMENT L EARNING M ODELThe primary objective of our study is to find the rewardfunction that, in some sense, best explains the observedbehavior of a decision agent. In the field of reinforcementlearning, it is a principle that the reward function is themost succinct, robust and transferable representation of adecision task, and completely determines the optimal policy(or set of policies) [22]. In addition, knowledge of the rewardfunction allows a learning agent to generalize better, sincesuch knowledge is necessary to compute new policies inresponse to changes in environment. These points motive ourhypothesis that IRL is a suitable method for characterizingtrading strategies.A. General Problem DefinitionLets define a (infinite horizon, discounted) MDP modelfirst. Let M {S, A, P, γ, R}, where:s S where S {s1 , s2 , ., sN } is a set of N states;A {a1 , a2 , ., ak } is a set of k possible actions;P {Paj }kj 1 , where Paj is a transition matrix suchthat Psaj (s0 ) is the probability of transitioning to states0 given action aj taken in state s;γ (0, 1) is a discount factor;R is a reward function such that R or R(s, a) is thereward received given action a is taken when in state s.Within the MDP construct, a trader or an algorithmictrading strategy can be represented by a set of primitives(P, γ, R) where R is a reward function representing thetrader’s preferences, P is a Markov transition probability representing the trader’s subjective beliefs about uncertain futurestates, and γ is the rate at which the agent discounts reward infuture periods. In using IRL to identify trading strategies, thefirst question that needs to be answered is whether (P, γ, R)is identified. Rust [7] discussed this identification problemin his earlier work in economic decision modeling. Heconcluded that if we are willing to impose an even strongerprior restriction, stationarity and rational expectations, thenwe can use non-parametric methods to consistently estimatedecision makers’ subjective beliefs from observations of theirpast states and decisions. Hence in formulating the IRLproblem in identifying trading strategies, we will have tomake two basic assumptions: first, we assume the policieswe model are stationary; second, the trading strategies arerational expected-reward maximizers.Here we define the value function at state s withrespectπ and discount γ to be Vγπ (s) P to policyttE[ t 0 γR(s , π(s )) π], where the expectation is over thedistribution of the state sequence {s0 , s1 , ., st } given policyπ (superscripts index time). We also define the Qπγ (s, a) forstate s and action a under policy π and discount γ to be theexpected return from state s, taking action a and thereafterfollowing policy π. And then we have the following twoclassical results for MDPs (see, e.g., [20], [19]):Theorem 1: (Bellman Equations) Let an MDP M {S, A, P, γ, R}, and a policy π : S A be given. Then,for all s S, a A, Vγπ and Qπγ satisfy:XVγπ (s) Rπ (s, π(s)) γPsπ(s) (j)Vγπ (j), s S(1)j SQπγ (s, a) Rπ (s, π(s)) γXPsa (j)Vγπ (j), s S(2)j STheorem 2: (Bellman Optimality) Let an MDP M {S, A, P, γ, R}, and a policy π : S A be given. Then, πis an optimal policy for M if and only if, for all s S:X Vγπ (s) max [Rπ (s, π(s)) γPsπ(s) (j)Vγπ (j)],a Aj S s S(3)The Bellman Optimality condition can be written in matrixformat as follows:Theorem 3: Let a finite state space S, a set of a A,transition probability matrix Pa and a discount factor γ (0, 1) be given. The a policy given by π is an optimal policyfor M if and only if, for all a A\π, the reward R satisfies:(Pπ Pa )(I γPπ )R 0(4)B. Linear Programming Approach to IRLThe IRL problem is, in general, highly underspecified,which has led researchers to consider various models forrestricting the set of reward vectors under consideration. Theonly reward vectors consistent with an optimal policy π arethose that satisfy the set of inequalities in Theorem 3. Notethat the degenerate solution R 0 satisfies these constraints,which highlights the underspecified nature of the problemand the need for reward selection mechanisms. Ng andRussel [11] advance the idea choosing the reward function tomaximize the difference between the optimal and suboptimalpolicies, which can be done using a linear programmingformulation. We adopt this approach, maximizing:X[Qπγ (s, a0 ) γ max 0 Qπγ ], a A(5)a A\as SPutting theorem 4 and 5 together, we have an optimizationproblem to solve to obtain a reward function under an optimalpolicy:XXmax [β(s) λα(s)]Rs Ss Ss.t.α(s) β(s), s S(Pπ Pa )(I γPπ )R β(s), a A, s S(Pπ Pa )(I γPπ )R 0(6)In summary, we assume an ergodic MDP process. Inparticular, we assume the policy defined in the system hasa proper stationary distribution. And we further assumethat trader’s trading strategies are rational expected rewardmaximizers. There are specific issues regarding the nondeterministic nature of trader’s trading strategies when dealing with empirical observations, and we will address themlater in the next section.Electronic copy available at: http://ssrn.com/abstract 1955965

C. Key Modeling IssuesOne of the key issues that arise in applications of IRLor apprenticeship learning to algorithmic trading is thatthe trader under observation may not appear to follow adeterministic policy. In particular, a trader observed in thesame state on two different occasions may take two differentactions, either because the trader is following a randomizedpolicy or because the state space used in the model lacksthe fidelity to capture all the factors that influence thetrader’s decision. To address the issue of non-deterministicpolicies, we need to first understand the relationship between a deterministic policy versus non-deterministic policyunder the assumption we made earlier. We use notationMD for Markov deterministic policy, and MR for Markovnon-deterministic policy. We can establish the relationshipbetween the optimality of a deterministic policy versus anon-deterministic policy through the following proposition([17]):Proposition: For all v V and 0 γ 1:sup {Rd γPd v} sup {Rd γPd v}, d Ad D M D(7)d D M RPolicies range in general from deterministic Markovianto randomized history dependent, depending on how theyincorporate past information and how they select actions. Inthe financial trading world, traders deploy different tradingstrategies where each strategy has a unique value proposition.We can theoretically use cumulative reward to represent thevalue system encapsulated in the various different tradingstrategies. For example in a simple keep-or-cancel strategyfor buying one unit, the trader has to decide when to placean order and when to cancel the order based on the marketenvironment (can be characterized stochastic processes) tomaximize its cumulative reward under the constraint of thetraders’ risk utility and capital limit. This can be realizedin a number of ways. It can be described as a functionR(s) meaning when the system is in state s the trader isalways looking for a fixed reward. This notion of valueproposition drives the traders to take corresponding optimalactions according to the market conditions. However due tothe uncertainty of the environment and the random error ofthe measurement in the observations, a deterministic policycould very likely be perceived to have a non-deterministicnature.Based on the proposition or equation 7, the optimal valueattained by a randomized policy is the same as the oneattained by a deterministic policy, and there exists an optimalvalue and it is unique in V . Therefore, we know that thesupremum value obtained from all policies can be used torecover an equivalent optimal stationary deterministic policy.Essentially we are looking for an optimal deterministicstationary policy which achieves the same optimal value asthe non-deterministic policy. This guarantees the learningagent to obtain a unique reward function that achieves theoptimal value. The merit of this approach is that the rewardfunction will be unique for a specific set of observations. Wewill not be concerned about whether the trader’s real policy isdeterministic or not. This is especially useful in the problemwhere we attempt to identify traders’ trading strategies basedon a series of observations.III. A MDP M ODEL FOR L IMIT O RDER B OOKCont et al. ([24], and [25]) make the claim that order flowimbalance and order volume imbalance have the strongestlink with the price changes. It seems that these two variablescan best capture the limit order book dynamics. It has beenproven effective in modeling buy-one-unit and make-thespread strategies by Hunt, et al. [27] where three price levelshave shown significantly good resemblance to the real marketcharacteristics. Other financial market microstructure studiesalso provide strong evidence of using order book imbalanceto represent the market supply and demand dynamics orinformation asymmetry ([5], [8], [6], [14] and [26]). Basedon this evidence, we choose two bid/ask volume imbalancevariables to capture the market environment, and we chooseposition/inventory level as a private variable of the trader.In summary, we use three sensory variables to characterizethe environment in which the traders operate. Now we candefine state s [T IM, N IM, P OS]T , and each variabletakes following discrete values:TIM - volume imbalance at the best bid/ask: {-1, 0, 1};NIM - volume imbalance at the 3rd best bid/ask: {-1,0, 1};POS - position status: {-1, 0, 1}.When the variable takes value 0 (in neutral state), it meansthat the variable takes mean (µ) value within µ 1.96σ;when the value is above µ 1.96σ, we define it as high;and when the value is below µ 1.96σ, we define it aslow. Essentially we have two external variables: TIM andNIM. Variables TIM and NIM inform the traders whethervolume imbalance is moving toward sell side (low), neutral,or toward buy side (high), as well as the momentum of themarket price movement. The private variable POS informstraders whether his or her inventory is low, neutral or high.All three variables are very essential for algorithmic tradersto make their trade decisions. We also define a set of actionsthat correspond to traders’ trading choices at each state a {PBL, PBH, PSL, PSH, CBL, CBH, CSL, CSH, TBL, TBH,TSL, TSH}, and each value is defined in TABLE I.We assume a highly liquid market where market orderswill always be filled, and we apply the model to a simulatedorder book where both limit orders and market orders areequally present.IV. E XPERIMENTSIn this section, we conduct two experiments using theMDP model defined earlier to identify algorithmic tradingstrategies. We use the six trader classes defined by Kirilenkoet. al. [15], namely High Frequency Traders, Market Makers,Opportunistic Traders, Fundamental Buyers, FundamentalSellers and Small Traders. In general, HFTs have a set ofdistinctive characteristics, such as, very high activity volumethroughout a trading day, frequent modification of orders,

A. Simulated E-Mini S&P 500 Futures MarketWhen simulating a system it is convenient to decomposethe system into its basic parts. A financial market can be understood as a set of market participants, a trading NSpumberofofTABLE IIT RADER GROUP CHARACTERIZATIONTraders68801268Order2 hours1 minuteLimits 30 30 Volume1%9%12761 minute 9%17620 seconds 120 12010%5808162 minutes0.35 seconds 120 120 3000 ation40%44%Cancellation20 40%20 40%10%9%44%20 40%10%10%35%20 40%31%38%33%38%50%77%40 60%70 TABLE IIIT RADER GROUP VALIDATIONmmaintenance of very low inventory levels, and an agnosticorientation toward long or short positions. Market Makers areshort horizon investors who follow a strategy of buying andselling a large number of contracts to stay around a relativelylow target level of inventory. Opportunistic Traders sometimes behave as Market Makers buying and selling arounda target position, and sometimes they act as FundamentalTraders accumulating long or short positions. FundamentalBuyers and Sellers are net buyers and sellers who accumulatepositions in one single direction in general. Small Traders arethe ones who have significant less activities during a typicaltrading day.In the first experiment, we are interested in separatingHigh Frequency Trading strategies from Market Making andOpportunistic Trading strategies in the simulated E-MiniS&P 500 futures market. From Figure (b) in Fig. 1, wesee that the behaviors of the Fundamental Buyers/Sellers aredistinctively different from the other algorithmic traders. It isclear that classification between the HFTs and these classesof trading strategies are relatively trivial. We therefore devoteour attention to separate High Frequency Trading strategiesfrom the Market Marking and the Opportunistic Tradingstrategies. We will start this section with a description ofthe design of our agent-based simulation for E-Mini S&P500 futures market [33]. We will then use the data generatedfrom this simulation as observations to recover the rewardfunctions of different kinds of trading strategies, and weapply various classification methods on these trading strategies in the reward space to see whether we can accuratelyidentify the different trading strategy classes. In the secondexperiment, we will focus on a specific High FrequencyTrading strategy called Spoofing, and try to separate thistrading strategy from the other High Frequency Tradingstrategies. In general, we test the hypothesis that rewardfunctions can be used to effectively identify High FrequencyTrading strategies in both within-group and across-groupsituations.SiPBH - place buy order higher than the 3rd best bid pricePBL - place buy order lower than the 3rd best bid pricePSH - place sell order higher than the 3rd best ask pricePSL - place sell order lower than the 3rd best ask priceCBH - cancel buy order higher than the 3rd best bid priceCBL - cancel buy order lower than the 3rd best bid priceCSH - cancel sell order higher than the 3rd best ask p

rebound), algorithmic trading was thought to be responsible for more than 70% of trading volume in the U.S. ([3], [9], [10], and [15]). Moreover, Kirilenko et al. [15] have shown that the key events in the Flash Crash have a clear interpretation in terms of algorithmic trading

Related Documents:

Verbal Behavior Verbal Behavior (V) is a class of behavior that is reinforced through the mediation of other persons (Skinner, 1957, p.2). Verbal Behavior is the application of behavior principles to language. Verbal Behavior categorizes language responses into different categories based on the function of the response Verbal Behavior is a subset of the science of Behavior Analysis

often referred to as behavior-based safety programs. Behavior-based safety programs are a systematic approach to promoting behavior supportive of injury prevention (Sulzer-Azaroff and Austin, 2000). There are numerous examples of effective behavior-based safety programs employing nonmonetary consequences such as feedback to increase safe

Verbal Behavior Verbal Behavior (V) is a class of behavior that is reinforced through the mediation of other persons (Skinner, 1957, p.2). Verbal Behavior is the application of behavior principles to language. Verbal Behavior categorizes language responses into different categories based on the function of the response Verbal

consumer's shopping behavior. Conformity behavior is a universal phenomenon in social psychology. Its essence is the change of attitude or behavior of individuals under group pressure [8]. In consumer behavior, herd behavior is mainly manifested in a shopping behavior in which a consumer individual is influenced by the

instrument were found: planning behavior pattern (0.703), solution oriented behavior pattern (0.503), prescriptive behavior pattern (0.515) and the behavior patterns (0.819) in general. The dependent variable is the students' achievements of the psychological foundations of learning and development

behavior, and others getting a more extended 'analysis' in terms of behavior. However, radical behaviorism stops short of identifying feelings as causes of behavior. Among other points of difference were a rejection of the reflex as a model of all behavior and a defense of a science of behavior

hand, refer to the perceived social pressure to perform or not to perform the behavior. The theory of planned behavior, in its intent to explain human behavior deals also with the antecedents of attitudes toward the behavior and subjective norms. The theory of planned behavior postulates that behavior is a function of relevant beliefs.

Dysfunctional customer behavior refers to customer actions that disrupt service en-counters by behaving against the organization's expectations and social norms [5,24]. This behavior has been variously described by scholars using the following terms: deviant consumer behavior [25], aberrant customer behavior [26], inappropriate behavior [27],