Time Series Clustering - UC3M

3y ago
66 Views
7 Downloads
3.36 MB
107 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Javier Atchley
Transcription

OutlineTime Series ClusteringAndrés M. Alonso1 Department2 Instituteof Statistics, UC3MFlores de LemusASDM - C02June 24 – 28, 2019, Boadilla del MonteAndrés M. AlonsoTime series clustering

OutlineOutline1Introduction2Time series clustering by features.3Model based time series clustering4Time series clustering by dependenceAndrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesWhat is the meaning of clustering?DefinitionCluster analysis or clustering is the task of grouping a set ofobjects in such a way that objects in the same group (called acluster) are more similar (in some sense or another) to eachother than to those in other groups (clusters).WikipediaKey elements of the definitionObjectsGroup (that can be hard or soft).Similarity.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesAlgorithms for clusteringConnectivity-based clustering (hierarchical clustering)Scotto, M., Alonso, A.M. and Barbosa, S. (2010) Clustering time series ofsea levels: an extreme value approach, Journal of Waterway, Port, Coastal,and Ocean Engineering, 136, 215–225.Centroid-based clusteringMaharaj, E.A., Alonso, A.M. and D’Urso, P. (2015) Clustering SeasonalTime Series Using Extreme Value Analysis: An Application to SpanishTemperature Time Series, Communications in Statistics - Case Studies andData Analysis, 1, 175–191.(Model) Distribution-based clusteringAlonso, A.M., Berrendero, J.R., Hernández, A. and Justel, A. (2006) Timeseries clustering based on forecast densities, Computational Statistics andData Analysis, 51, 762–766.Density-based clusteringAndrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesExamples of clustering algorithmsAndrés M. AlonsoPRBRESFPACNLCHNYLWKWNNWL0200400600800NWTime series clustering1000120014001600These algorithms connect “objects”to form "clusters" based on theirdistance/similarity.A cluster can be described by themaximum distance needed toconnect parts of the cluster.At different distances, differentclusters will form, which can berepresented using a dendrogram.BSConnectivity-based clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesExamples of clustering algorithms 2 4 6Centroid-based clustering 8Clusters are represented by a central“object”, which may not necessarilybe a member of the data set. 10 12 14k-means 16 18k-mediods or PAM 20 2235Andrés M. AlonsoTime series clustering404550

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesExamples of clustering algorithmsCentroid-based clusteringClusters are represented by a central“object”, which may not necessarilybe a member of the data set.k-meansk-mediods or PAMAndrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproaches0.4Examples of clustering algorithms0.30.20.1The clustering model most closelyrelated to statistics is based ondistribution models.Clusters can then easily be definedas objects belonging most likely tothe same distribution/model.0.0(Model) Distribution-based clusteringuknlcyesfrnobedefi100150(b)Andrés M. AlonsoTime series clustering200luseieatlvrohusi

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesExamples of clustering algorithmsDensity-based clusteringClusters are defined as areas ofhigher density than the remainder ofthe data set.Objects in sparse areas are usuallyconsidered to be noise and borderpoints.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesThe problemTime series clustering problems arise when we observe asample of time series and we want to group them into differentcategories or clusters.This a central problem in many application fields and hencetime series clustering is nowadays an active research area indifferent disciplines including finance and economics, medicine,engineering, seismology and meteorology, among others.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesApproaches for time series clusteringTime series clustering by features.Model based time series clustering.Time series clustering by dependence.Liao, T.W. (2005) Clustering of time series data-a survey, Pattern Recognition, 38,1857–1874.Aghabozorgi, S., Shirkhorshidi, A.S. and Wah, T.Y. (2015) Time-series clustering– A decade review. Information Systems 53 16–38.Caiado, J., Maharaj, E. A., and D’Urso, P. (2015) Time series clustering. In:Handbook of cluster analysis. Chapman and Hall/CRC.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesApproaches for time series clusteringTime series clustering by features.Kakizawa, Y., Shumway, R.H. and Taniguchi, M. (1998) Discrimination andclustering for multivariate time series, J. Am. Stat. Assoc., 93, 328–340.Vilar, J.A. and Pértega, S. (2004) Discriminant and cluster analysis forGaussian stationary processes: Local linear fitting approach, J.Nonparametr. Stat., 16, 443–462.Caiado, J., Crato, N. and Peña, D. (2006) A periodogram-based metric fortime series classification, Comput. Statist. Data Anal. 50, 2668-2684.Scotto, M., Alonso, A.M. and Barbosa, S. (2010) Clustering time series ofsea levels: an extreme value approach, Journal of Waterway, Port, Coastal,and Ocean Engineering, 136, 215–225.D’Urso, P., Maharaj, E.A. and Alonso, A.M. (2017) Fuzzy Clustering of TimeSeries using Extremes, Fuzzy Sets and Systems, 318, 56–79.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesApproaches for time series clusteringModel based time series clustering.Alonso, A.M., Berrendero, J.R., Hernández, A. and Justel, A. (2006) Timeseries clustering based on forecast densities, Computational Statistics andData Analysis, 51, 762–766.Scotto, M.; Barbosa, S. and Alonso, A.M. (2009) Model-based clustering ofBaltic sea-level, Applied Ocean Research, 31, 4–11.Vilar-Fernández, J.A., Alonso, A.M. and Vilar-Fernández, J.M. (2010)Nonlinear time series clustering based on nonparametric forecastdensities, Computational Statistics and Data Analysis, 54, 2850–2865.Time series clustering by dependence.Alonso, A.M. and Peña, D. (2019) Clustering time series by dependency,Statistics and Computing, 29, 655–676.Alonso, A.M.; Galeano, P. and Peña, D. (2019) A robust procedure to builddynamic factor models with cluster structure, Submitted.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroduction to clusteringThe problemApproachesPackages for time series clusteringTSclust: Package for Time Series Clustering.Montero, P and Vilar, J.A. (2014) TSclust: An R Package for TimeSeries Clustering. Journal of Statistical Software, 62(1), 1-43.dtwclust: Time Series Clustering Along withOptimizations for the Dynamic Time Warping és M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringTime series clustering by featuresX 1 , X 2 , . . . , X n },We have a set of univariate time series, X {Xwhere X i (Xi,1 , Xi,2 , . . . , Xi,T ) and we want to cluster them.Starting pointTo choose a metric to assess the dissimilarity between two timeseries.This metric plays a crucial role in time series clustering.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringTime series clustering by featuresConceptually most of the dissimilarity criteria proposed for timeseries clustering lead to a notion of similarity relying on:Proximity between raw series data.Proximity between features of the time series.Proximity between underlying generating processes.Raw series data can be considered as naïve features of thetime series.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringRaw data clusteringWord NO0.60.5It consists on measure thedistance, D, between two timeseries using an element-wiseapproach:0.40.30.20.10 0.1 0.2 0.3010002000300040005000600070008000600070008000X i , X j ) d (XX i X j ),D(XWord YES0.5where d is a distance onRT .0.40.30.2This approach has a drawbacksince it requires that the series tobe aligned.0.10 0.1 0.2 0.3 0.4Andrés M. Alonso010002000Time series clustering300040005000

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringRaw data clustering110.80.80.60.60.40.40.20.200 0.2 0.2 0.4 0.4 0.60100020003000400050006000700080009000 00 0.2 0.2 0.4 0.4 0.600100020003000400050006000700080009000 0.6Euclidean distance matrix: 0 14.1777 12.3613 12.7610 14.17770 10.5822 11.3088 12.3613 10.582208.0949 12.7610 11.30888.09490Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringRaw data clusteringDynamic time warping distance matrix: 0 43.4941 70.2141 70.1087 43.49410 75.1402 78.3575 70.2141 75.14020 36.7705 70.1087 78.3575 36.77050 Datafile yesnot.xls Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringTime series clustering by featuresRaw data clustering it is aninteresting approach when weexpect the differences in the levelof the series.Two AR(1) and two MA(1) timeseries:43210 1 2 3Euclidean distance matrix: 0 51.206 48.73551.18451.206051.47250.70948.73551.472051.669 4 5 51.18450.709 51.669 0Andrés M. 05006007008009001000543210 1 2 3 4Time series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clusteringBut, in this case, autocorrelation functions are a “good”clustering criteria:1Sample AutocorrelationSample Autocorrelation10.50 g12141618201Sample AutocorrelationSample Autocorrelation0 0.52010.50 0.50.50246810Lag1214161820Andrés M. Alonso0.50 0.5Time series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clusteringAssume that we have two stationary series, X and Y : H0 : ρ X (ρX ,1 , ρX ,2 , . . . , ρX ,m )′ ρ Y (ρY ,1 , ρY ,2 , . . . , ρY ,m )′H1 : ρ X (ρX ,1 , ρX ,2 , . . . , ρX ,m )′ 6 ρ Y (ρY ,1 , ρY ,2 , . . . , ρY ,m )′where ρX ,k and ρY ,k are the corresponding autocorrelations.We can use the following test statistics:Tn,m nXmk 1(rX ,k rY ,k )2 ,where rX ,k and rY ,k are the estimated autocorrelations.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clusteringTn,m can be used as a distance measure.It is valid/correct when the series are independent.But its distribution changes if the series X and Y arecross-dependent.So, we need a procedure to derive the distribution of Tn,m inorder to be able to evaluate if a given value tn,m is significantlydifferent from zero.Andrés M. AlonsoTime series clustering

IntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringIntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceAutocorrelation clusteringSubsampling algorithm to obtain the distribution of Tn,m :1Let X j (Xj , Xj 1 , . . . , Xj l 1 ) and Y j (Yj , Yj 1 , . . . , Yj l 1 )with j 1, 2, . . . , n l 1 be the subsamples of l consecutiveobservations from X and Y , respectively. We calculate the j-th(j)subsampling statistic, Tl,m , by:(j)Tl,m l23Xmk 1(bρXj ,k rρbj ,k )2 ,where ρbXj ,k and ρbYj ,k are the k -th estimated autocorrelationsusing the subsamples X j and Y j , respectively.b n,l (·).We calculate gn,l (1 α) the 1 α quantile of GWe reject H0 if and only if Tn,m gn,l (1 α).Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clustering - ExampleCode and description of interest rate series:CodeDescriptionBME08040203001Reference rates/Banks/Short term prime rateBME08040203002Banks lending rates/Current account overdrafts/EffectiverateBME08040203003Banks lending rates/Exceed in credit card/Effective rateBME08040203005Reference rates/Saving banks/Short term prime rateBME08040203006Savings banks lending rates/Current account overdrafts/Effective rateBME08040203007Savings banks lending rates/Credit account overdrafts/Effective rateAndrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clustering - ExampleIt is clear that series are dependent.Interest rates (Banks). January 1985 December 200130252015Short term prime rateCurrent account overdraftsCredit account overdrafts1050050100150200250200250Interest rates (Saving banks). January 1985 December 2001302520151050050100Andrés M. Alonso150Time series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringAutocorrelation clustering - ExampleAssociated p-value for each pair stationary 9090.0000.262-Alonso, A.M. and Maharaj, E.A. (2006) Comparison of time series usingsubsampling, Computational Statistics and Data Analysis, 50,2589–2599.Datafile BME.xls Andrés M. AlonsoTime series clustering

IntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringIntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceAutocorrelation clustering - Example10.90.81 - ndrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringSpectral domain clusteringAssume that we have two stationary series, X and Y withspectral densitiesX λX γX ,k exp( ikω)k andλY X k γY ,k exp( ikω)As before, we are interested on testing: H0 : λX (ω) λY (ω) (0 ω π).H1 : λX (ω) 6 λY (ω)Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringSpectral domain clusteringDiggle y Fisher (1991) propose to compare the integratedperiodograms:FX (ωj ) XjIX (ωi )/XmIX (ωi ),FY (ωj ) XjIY (ωi )/XmIY (ωi ),andi 1i 1i 1i 1where ωi 2πi/n, IX (·) is the periodogram, andm ⌈(n 1)/2⌉.We can use the following test statistics:Dm sup FX (ω) FY (ω) or Wm Andrés M. AlonsoRπ0(FX (ω) FY (ω))2 d F̄ (ω).Time series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringSpectral domain clusteringWe retake the word classification problem (boat versus 00011111111000000000000000000000000101100Alonso, A.M., Casado, D., Lopez-Pintado, S. and Romo, J. (2014)Robust Functional Classification for Time Series, Journal ofClassification, 31, 325–350.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringExtreme value clusteringIn some applications, the main interest is the highest (lowest)level that we can observe in a time series in a given period.To build dykes you need to know the maximum level of thesea in the area that you want to protect.Rising sea levels are of great concern to coastalcommunities around the world.To prevent the effect of temperatures in health, you needinformation about the highest temperature.In finance/insurance, the lowest values correspond tocapital losses.Andrés M. AlonsoTime series clustering

IntroductionTime series clustering by featuresModel based time series clusteringTime series clustering by dependenceIntroductionRaw data clusteringAutocorrelation clusteringSpectral domain clusteringExtreme value clusteringExtreme value clusteringThe Generalized Extreme Value distribution, as the followingform:( 1 )X µ ξG(x) exp 1 ξ(1)σdefined on {x : 1 ξ( x µσ ) 0} where µ , σ 0,and ξ ,The three parameters µ, σ and ξ are the location, scaleand shape parameters, respectively where ξ determine

Caiado, J., Maharaj, E. A., and D’Urso, P. (2015) Time series clustering. In: Handbook of cluster analysis. Chapman and Hall/CRC. Andrés M. Alonso Time series clustering. Introduction Time series clustering by features Model based time series clustering Time series clustering by dependence Introduction to clustering

Related Documents:

Chapter 4 Clustering Algorithms and Evaluations There is a huge number of clustering algorithms and also numerous possibilities for evaluating a clustering against a gold standard. The choice of a suitable clustering algorithm and of a suitable measure for the evaluation depen

preprocessing step for quantum clustering , which leads to reduction in the algorithm complexity and thus running it on big data sets is feasible. Second, a newer version of COMPACT, with implementation of support vector clustering, and few enhancements for the quantum clustering algorithm. Third, an implementation of quantum clustering in Java.

6. A sample social network graph 7. Influence factor on for information query 8. IF calculation using network data 9. Functional component of clustering 10. Schema design for clustering 11. Sample output of Twitter accounts crawler 12. Flow diagram of the system 13. Clustering of tweets based on tweet data 14. Clustering of users based on .

Data mining, Algorithm, Clustering. Abstract. Data mining is a hot research direction in information industry recently, and clustering analysis is the core technology of data mining. Based on the concept of data mining and clustering, this paper summarizes and compares the research status and progress of the five traditional clustering

clustering engines is that they do not maintain their own index of documents; similar to meta search engines [Meng et al. 2002], they take the search results from one or more publicly accessible search engines. Even the major search engines are becoming more involved in the clustering issue. Clustering by site (a form of clustering that

SMB_Dual Port, SMB_Cable assembly, Waterproof Cap RF Connector 1.6/5.6 Series,1.0/2.3 Series, 7/16 Series SMA Series, SMB Series, SMC Series, BT43 Series FME Series, MCX Series, MMCX Series, N Series TNC Series, UHF Series, MINI UHF Series SSMB Series, F Series, SMP Series, Reverse Polarity

We create a general framework for ontology-driven subspace clustering. This framework can be most beneficial for the hierar-chically organized subspace clustering algorithm and ontology hi-erarchy, i.e., it is independent of the clustering algorithms and on-tology application domain. To demonstrate the usefulness of this

Language Policy in the Russian Federation: language diversity and national identity by Marc Leprêtre Abstract This paper gives an overview on the different language policies implemented in the Russian Federation, stressing the relevance of the historical background, the relations between language and nationalism, and language promotion as a tool for preventing inter-ethnic conflicts and for .