Finding Progression Stages In Time-evolving Event Sequences

3y ago
17 Views
2 Downloads
444.62 KB
11 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Tia Newell
Transcription

Finding Progression Stages in Time-evolvingEvent SequencesJaewon Yang† Julian McAuley† Jure Leskovec† Paea LePendu‡ Nigam Shah‡†Computer Science, Stanford University, {jayang, jmcauley, jure}@cs.stanford.edu‡Biomedical Informatics, Stanford University, {plependu, nigam}@stanford.eduABSTRACTEvent sequences, such as patients’ medical histories or users’ sequences of product reviews, trace how individuals progress overtime. Identifying common patterns, or progression stages, in suchevent sequences is a challenging task because not every individualfollows the same evolutionary pattern, stages may have very different lengths, and individuals may progress at different rates.In this paper, we develop a model-based method for discovering common progression stages in general event sequences. Wedevelop a generative model in which each sequence belongs to aclass, and sequences from a given class pass through a common setof stages, where each sequence evolves at its own rate. We then develop a scalable algorithm to infer classes of sequences, while alsosegmenting each sequence into a set of stages. We evaluate ourmethod on event sequences, ranging from patients’ medical histories to online news and navigational traces from the Web. Theevaluation shows that our methodology can predict future events ina sequence, while also accurately inferring meaningful progressionstages, and effectively grouping sequences based on common progression patterns. More generally, our methodology allows us toreason about how event sequences progress over time, by discovering patterns and categories of temporal evolution in large-scaledatasets of events.Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications—Data miningKeywords: User modeling, time series, event sequences1.INTRODUCTIONA variety of natural processes generate sequences of data whosecomplex temporal dynamics need to be modeled. Such event sequences, in which individual entities generate a series of observations drawn from a finite categorical vocabulary, are ubiquitous inmany applications. For example, an event sequence could represent a product purchasing history of an individual, a person’s Internet browsing history, or a sequence of symptoms exhibited by apatient. Current affiliation: LinkedInCopyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW’14, April 7–11, 2014, Seoul, Korea.ACM ishdLanguageUniteddNationsEarthFigure 1: Examples of the progression stages and the classes(Astronomy and U.S.A.) that we learn from Web navigationtrajectories in the online game Wikispeedia. The top five mostfrequently visited pages are shown for each stage. Players startat some Wikipedia page and then move to the pages related toU.S. In the third stage, red players move towards astronomyrelated pages, while blue players navigate towards U.S.-relatedtopics. Both reach their corresponding goal pages in stage four.Event sequence data has two natural and interesting characteristics: The first is that sequences progress through distinct stages;the second is that there may be many different types or classes ofprogression.Progression stages: Patterns of human behavior are generallynot static as individuals evolve over time. Continuing the above examples, as users acquire more products, their tastes will change andthus, their preferences will go through a series of “expertise” levels [16]; or, as users search for information on the Web, their navigation strategies will progress through a series of phases [25]; finally, as a patient’s disease progresses through various stages, theywill exhibit different sets of symptoms. Understanding and modeling such sequences requires that we understand the mechanismsthat cause them to change over time. In particular, sequences evolvethrough a series of progressing stages or phases. Other examplesof progression sequences range from living cells that undergo various stages of mitosis to chess games that progress through severalnatural stages, like the opening, middle and the end of the game.Classes of progression: In addition to understanding the variousstages through which an individual sequence progresses, it is alsonecessary to categorize or group sequences according to how theirtemporal behavior evolves. For example, given a large set of product purchasing sequences of individuals, classes could representgroups of people that undergo similar evolution of their productpurchasing patterns, e.g., some users may gradually develop a tastefor action movies, while others progress toward drama. Similarly,different patients may progress through stages of a disease differently depending on their age or gender [10]. Here classes could

correspond to groups of people with a common disease or diseaseprogression pattern.Thus, in order to understand the temporal dynamics of event sequences it is crucial to solve two tasks. The first task requiresus to identify different stages through which sequences progressand then segment individual sequences according to the discoveredstages. The second tasks requires us to model different categoriesor classes of sequences. That is, to model classes of sequences thatevolve according to different patterns of events. However, we haveto consider both tasks simultaneously in order to better capture thediversity present in real sequence data.Models for identifying progression stages are useful when solving a variety of tasks. Firstly, with richer models of temporal dynamics, we are better able to predict future events, such as the nextproduct a person will consume, or the next symptom a patient willexhibit. Secondly, the stages and categories that we discover maythemselves be meaningful. For instance, such models can help usto predict a patient’s disease stage more accurately than is possibleby examining their symptoms in isolation.Modeling these types of temporal dynamics is a fundamentallydifficult problem for a variety of reasons. Firstly, not every individual sequence evolves at the same rate. In addition, not everysequence will follow the same progression path or even progressthrough the same set of stages. Moreover, data may only be partially observed, e.g., our first observation of a patient’s symptomsmay occur only after they already exhibit advanced symptoms. Finally, as there may be many different types of progression, sequences have to be both individually classified as well as segmented.A variety of methods exist to model event sequences, though theabove complications present issues for many existing models. Approaches based on mining frequently occurring subsequences [2,18, 26] are not appropriate for this task, as the level of noise andlarge state-spaces mean that any specific pattern is extremely unlikely to appear repeatedly. Hidden Markov Models capture howlatent states change [5, 7, 19, 22], though they typically assume thatall sequences share the same set of latent states and thus progressin the same way. Other models of time-varying data, which arestate-of-the-art for tasks such as movie recommendation on Netflix[9, 15], generally assume that users evolve according to a “globalclock,”, i.e., their progression is tied to the calendar date. In contrast, modeling patient records (for example) requires that each individual progresses according to his or her own personal timescale.It is because of these above difficulties that a new model is required.Present work: Finding progression stages. In this paper, we consider a broad definition of categorical event sequences: At a basiclevel, we model ordered sets of events drawn from a finite vocabulary. This level of generality allows us to model data from a variety of sources, including product reviews, browsing logs, mediastreams, and medical records.We develop scalable methods to discover natural patterns of progression in time-evolving categorical sequence data. We achievethis by grouping sequences into different classes, based on commontemporal patterns of events, while also individually segmenting sequences into automatically discovered progression stages. Bothof these tasks are performed as part of a single optimization procedure, so that we simultaneously learn the categories and identify the progression stages of individual sequences. Our model ishighly flexible in terms of how individual sequences progress—for instance, not every sequence needs to progress through everystage, and each individual sequence may progress through stages ata different rate; this flexibility is essential to capture the noise andvariability present in real data.Stage 1Stage 2 Stage 3 Stage 4Class 1Class 2(a) Input sequences(b) OutputFigure 2: Problem definition: Given input event sequences (a),we aim to categorize sequences into classes based on how theyevolve, and we divide each sequence into progression stages (b).For example, Figure 1 illustrates the output of our algorithmwhen applied to human browsing traces on Wikipedia. We discovertwo sequence classes: people trying to reach pages about astronomy and people navigating towards U.S.-related pages. Simultaneously, we discover four stages of browsing behavior through whichusers progress when navigating towards a particular webpage.More broadly, we apply our models to real-world event sequencesfrom a variety of sources. We model people’s consumption patternson product review websites, such as RateBeer.com; we model people’s browsing behavior using log data from Wikispeedia (a gamethat requires users to navigate Wikipedia pages [25]); and we applyour models to medical data of patients with chronic kidney disease.In terms of experiments, we focus on three different aspects: predicting individual events, inferring progression stages, and grouping sequences into classes. For each aspect, we add qualitativeanalysis to show that our models help us to better understand andreason about the temporal dynamics of sequence data. First, weevaluate the ability of our method to predict future events in sequence data, e.g., the next product a person will consume, the nextpage that she will navigate to, or the next symptom that she will exhibit. We observe that our method achieves a 30% gain in accuracycompared to existing methods for future-event prediction.Second, we evaluate the accuracy and usefulness of the stagesthemselves, which we do by comparing them to known progression stages of patients with chronic kidney disease. The evaluationshows that our method can correctly estimate at what stage a symptom will appear, with a rank correlation higher than 0.8. We alsoanalyze the stages that we infer in other datasets and observe thatthe speed at which a sequence progresses between stages signalsthe longevity of the sequence. For example, reviewers who advance too quickly or too slowly tend to produce fewer reviews intotal compared to those who advance moderately.Third, in our qualitative analysis, we find that the classes of sequences that we discover from navigation trajectories on Wikipediacorrespond to different navigation strategies. We also discover thatnew users on product review websites initially consume similarproducts, before gradually “fanning out” and developing their owntastes, and then finally converging upon common subsets of products favored by “experts” in the community.The remainder of this paper is organized as follows. In Section2 we propose our model. Section 3 describes the data. Section4 shows our experiments on event prediction, Section 5 discussesexperiments on progression stages, and Section 6 presents experiments on classes of sequences. We discuss related work and conclude in Sections 7 and 8.2.PROPOSED METHODOur goal in this paper is to discover the stages of progression thatare common to a given set of event sequences. To achieve this goal,we develop a method based on a conceptual probabilistic model,which specifies how observed event sequences are generated from

latent stages. We formulate the problem, develop the generativemodel, and then show how the latent stages of this model can beefficiently learned.2.1Problem DefinitionWe begin by defining the problem of finding progression stages.We assume that we are given a set of event sequences of differentlengths, and we aim to infer their progression stages and classes.Our problem formulation is based on the following intuitions.First, each event sequence progresses through a set of latent,discrete-valued “stages” over time, and observed events are generated depending on the sequence’s current stage. Second, not onlydoes each sequence have a different length, but the duration ofprogression stages for each sequence can be substantially different; some stages progress slowly, while others do so more quickly.Moreover, sequences may not progress through all stages, i.e., theymay start and finish at some intermediate stage.The final intuition in our problem is that for any set of eventsequences, there are multiple possible patterns of progression. Tomodel different types of progression, we assume that there are latent classes or categories of event sequences, where sequences belonging to the same class progress through events in a similar way.We then aim to automatically “cluster” or “group” sequences toidentify such common patterns of progression. In this way, we develop an unsupervised approach to clustering sequence progressiondata, by identifying sets of event sequences that follow commontrajectories.We formulate the problem of event sequence segmentation andclassification as follows:P ROBLEM 1. Given a set of event sequences, the problem ofsequence segmentation and classification is to: find the class that each sequence belongs to; and assign each event to a stage, with stage assignments beingnon-decreasing over time.We illustrate the process in Fig. 2, and describe it in detail below.2.2Model DescriptionHere, we describe the generative process that we develop formodeling how observed event sequences are generated from a setof underlying latent progression stages.We denote each event sequence by xi , i 1, . . . , N , and thej-th event of xi (j ordered by time) by a categorical-valued xij {1, . . . , M } where N is the number of sequences and M is thenumber of possible events. Each sequence may have a differentlength; we Pdenote by L the sum of the lengths of all sequences(i.e., L i xi ). We also assume that there are C classes ofsequences and that each class divided into K stages. We assumefor simplicity that all classes have K stages, though our model caneasily be adapted to accommodate a different number of stages perclass.Each sequence xi belongs to a single class ci {1, . . . , C}. Foreach event xij xi , we define sij {1, . . . , K} to be the stage ofthe sequence xi at time j. Stages sij are a non-decreasing functionof time, i.e., a sequence never progresses “backward”. i, j, kj k sij sik .(1)From a modeling perspective, this constraint means that we capture patterns of temporal evolution that relate to the sequences ofevents, but are not tied to the exact time, or overall trends in thedataset. Also note that we do not require that any sequence xishould progress through all stages: some sequences may beginfrom intermediate stages, while some other sequences may endwithout reaching the last stage.Given class ci and stages sij for sequence xi , we now specifyhow individual events xij are generated. We employ a very simple generative process where xij is independently drawn from amultinomial distribution with parameter Θ(ci , sij ) RM :xij Multinomial(Θ(ci , sij ))Here, each Θ(ci , sij ) represents separate distribution of events fora given stage sij and class ci . This way, we can ensure that sequences from the same class should have similar sets of events during the same stage.Last, we also assume that Θ(ci , sij ) is drawn from a uniformDirichlet distribution with a hyperparameter λ:Θ(ci , sij ) Dirichlet(λ).Note that our approach can be generalized for generating xij withmore sophisticated models, e.g., we could model xij in each stageand each class using Latent Dirichlet Allocation (LDA) [4]. However, we found that our simple multinomial process works reliablyin practice and allows us to fit the model very efficiently.2.3Inferring Progression StagesWe now explain how we can learn the stages of progression inevent sequences based on our model. We are given a set of eventsequences {xi }. We also assume that we are given the number ofclasses C and the number of stages K. (We will explain later howto estimate values for C and K.) Our goal is to learn for each sequence xi the class membership ci of the sequence and the stageassignments sij for each event xij in the sequence. We achieve thisby fitting the model, i.e., we find classes ci , stages sij , and multinomial distributions Θ {Θ(p, q) p 1, ., C, q 1, ., K} bymaximizing the log likelihood:l(Θ, {ci }, {sij }; {xi }) log P ({xi } Θ, {ci }, {sij })Because variables xij are conditionally independent of each othergiven {ci }, {sij }, the log-likelihood becomesXlog P ({xi } Θ, {ci }, {sij }) log P (xij Θ(ci , sij )).i,jThus, we aim to solve the following optimization problem:Xargmaxlog P (xij Θ(ci , sij ))(2){ci },{sij }%j ,Θ i,jwhere {sij } %j is the monotonicity constraint in Eq. 1.Optimizing Eq. 2 jointly for all sets of variables is highly challenging, since the problem is combinatorial and non-convex. Wenote that our formulation can be naturally cast in the framework ofExpectation-Maximization (EM), where we compute soft assignments of the stages and the classes at one step, and then update Θusing these soft assignments. We note that we have experimentedwith the EM algorithm and found that EM converges prohibitivelyslowly. Thus we employ a coordinate ascent strategy, which is1,000 times faster than EM in our experience, while yielding results of similar quality. Our coordinate ascent strategy is describedbelow.As illustrated in Figure 3, we iteratively update subsets of variables. First, we update Θ while keeping {ci } and {sij } fixed(Fig. 3 (a)). Second, we update {ci } and {sij } with Θ fixed (Fig. 3(b)). We iterate these two steps until convergence, i.e., until theclasses and the stages that we learn do not change between successive iterations.

Stage 1Stage 2 Stage 3 Stage 4Class 1Class 2(b) Updating Θ(a) Updating stagesFigure 3: Method description. (a) Update of stage and classassignments. (b) Update of model parameter Θ for each classand stage. We iterate the two steps until convergence.Updating Θ. With stages {sij } and classes {ci } fixed, we aimto find parameters Θ that maximize the log-likelihood l(Θ) log P ({xi } Θ, {ci }, {sij }):argmax l(Θ) log P ({xi } Θ, {ci }, {sij }).ΘBecause variables xij are conditionally independent of each othergiven {ci }, {sij }, l(Θ) can be represented by summing log-probabilities for xij :XXl(Θ) log P (xij ci , sij , Θ) log P (xij Θ(ci , sij )).i,ji,jNote that P (xij Θ(ci , sij )) does not depend on Θ(p, q) if p 6 cior q 6 sij . Thus, l(Θ) is separable with respect to Θ(p, q):l(Θ) C XKXl(Θ(p, q)),p 1 q 1l(Θ(p, q)) X1{ci p sij q} log P (xij Θ(p, q)),i,jwhere 1 denotes an indicator function. We find the optimal valueof Θ(p, q) by maximizing l(Θ(p, q)). Because P (xij Θ(p, q)) isa multinomial distribution, the optimal value of Θ(p, q) is the sameas the empirical probability smoothed by the Dirichlet parameter λ:Pλ i,j 1{ci p sij q xij r}PΘ(p, q)r

Second, we evaluate the accuracy and usefulness of the stages themselves, which we do by comparing them to known progres-sion stages of patients with chronic kidney disease. The evaluation shows that our method can correctly estimate at what stage a symp-tom will appear, with a rank correlation higher than 0.8. We also

Related Documents:

Capacitors 5 – 6 Fault Finding & Testing Diodes,Varistors, EMC capacitors & Recifiers 7 – 10 Fault Finding & Testing Rotors 11 – 12 Fault Finding & Testing Stators 13 – 14 Fault Finding & Testing DC Welders 15 – 20 Fault Finding & Testing 3 Phase Alternators 21 – 26 Fault Finding & Testing

Enlisted Career Progression Charts 10-1-1. General This chapter contains career progression charts for each enlisted career management field (CMF) and approved for enlisted classification. 10-1-2. Specifications for Enlisted Career Progression Charts This chapter contains career progression charts for each enlisted specialty. The chapter is

Student Progression Plan Table of Contents I. Student Progression Plan Overview A. Purpose of Student Progression Plan B. Mission, Vision, and Values of School District C. Nondiscrimination Statement II. Georgia Standards of Excellence A. Georgia Standards of Excellence Ge

Children build on the last step by finding multiples of 10% and other known percentages. They explore different methods of finding certain percentages e.g. Finding 20%by dividing by 10 and multiplying by 2 or by dividing by 5. They also explore finding 5% by finding half of 10%. Using these m

progression (scheme below). In human, depending on the tissues, it takes between 10 to 30 years from the initiation stages to advanced metastisized stages of tumors. Within this peroid, there are windows of opportunity to block and intervene the progression of cancer using relatively non-toxic dietary phytochemicals and or

The progression of Alzheimer’s disease and other dementias. Symptoms in the later stages. People in the later stages of dementia become increasingly frail and . depend more on other people for support. As dementia progresses and causes changes to the person’s brain, they may struggle to do many of the things they used to.

follow-up of Bosniak IIF cysts is necessary, because of the risk of progression and malignancy [3–5]. The likelihood and time to progression is undetermined in the literature. The progression rate of IIF in our selected population has been reported as 4.6%, with all malignant cysts progressing within

Copyright National Literacy Trust (Alex Rider Secret Mission teaching ideas) Trademarks Alex Rider ; Boy with Torch Logo 2010 Stormbreaker Productions Ltd .