Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural .

1y ago
4 Views
1 Downloads
2.64 MB
9 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Kairi Hasson
Transcription

Social-STGCNN: A Social Spatio-Temporal Graph Convolutional NeuralNetwork for Human Trajectory PredictionAbduallah Mohamed1 , Kun Qian1Mohamed Elhoseiny2,3, ** , Christian Claudel1, **123The University of Texas at AustinKAUSTStanford udel}@utexas.edu, mohamed.elhoseiny@kaust.edu.saSocial-STGCNN ModelAbstractBetter machine understanding of pedestrian behaviorsenables faster progress in modeling interactions betweenagents such as autonomous vehicles and humans. Pedestriantrajectories are not only influenced by the pedestrian itselfbut also by interaction with surrounding objects. Previousmethods modeled these interactions by using a variety ofaggregation methods that integrate different learned pedestrians states. We propose the Social Spatio-Temporal GraphConvolutional Neural Network (Social-STGCNN), whichsubstitutes the need of aggregation methods by modeling theinteractions as a graph. Our results show an improvementover the state of art by 20% on the Final Displacement Error(FDE) and an improvement on the Average Displacement Error (ADE) with 8.5 times less parameters and up to 48 timesfaster inference speed than previously reported methods. Inaddition, our model is data efficient, and exceeds previousstate of the art on the ADE metric with only 20% of the training data. We propose a kernel function to embed the social interactions between pedestrians within the adjacency matrix.Through qualitative analysis, we show that our model inherited social behaviors that can be expected between pedestrians trajectories. Code is available at https://github.com/abduallahmohamed/Social-STGCNN .1. IntroductionPredicting pedestrian trajectories is of major importancefor several applications including autonomous driving andsurveillance systems. In autonomous driving, an accurateprediction of pedestrians trajectories enables the controllerto plan ahead the motion of the vehicle in an adversarial environment. For example, it is a critical componentfor collision avoidance systems or emergency braking systems [2, 18, 16, 22]. In surveillance systems, forecasting**Equal advising.Predicted future trajectoriesSingle PassSpatio-Temporal graph ofobserved trajectoriesFigure 1. Pedestrian future trajectories prediction using the SocialSTGCNN model. The social interactions between pedestrians andtheir temporal dynamics are represented by a spatio-temporal graph.We predict the future trajectories in a single pass.pedestrian trajectories is critical in helping identifying suspicious activities [15, 28, 20].The trajectory of a pedestrian is challenging to predict,due to the complex interactions between the pedestrian withthe environment. Objects potentially influencing the trajectory of a pedestrian include physical obstacles such as treesor roads, and moving objects including vehicles and otherpedestrians. According to [19], 70% of pedestrians tend towalk in groups. The interactions between pedestrians aremainly driven by common sense and social conventions. Thecomplexity of pedestrian trajectory prediction comes fromdifferent social behaviors such as walking in parallel withothers, within a group, collision avoidance and merging fromdifferent directions into a specific point. Another source ofcomplexity is the randomness of the motion, given that thetarget destination and intended path of the pedestrian areunknown.The social attributes of pedestrian motions encouragedresearchers in this area to focus on inventing deep methodsto model social interactions between pedestrians. In theSocial-LSTM [1] article, deep learning based model is applied to predict the pedestrians trajectories by modeling eachpedestrian trajectory via a recurrent deep model. The outputsof recurrent models are made to interact with each other viaa pooling layer. Several articles [17, 14, 30] followed thisdirection. Social-LSTM [1] modeled the pedestrian trajec-14424

tories as a bi-variate Gaussian distribution, while some ofothers aimed at predicting deterministic trajectories. Anotherdirection is to use Generative Adversarial Networks (GANs)for this task, assuming that the distribution of trajectoriesis multi-modal. Several articles [6, 23, 13] used GANs topredict distributions of future trajectories. For these models,generators are designed using recurrent neural networks, andagain, aggregation methods are relied upon to extract thesocial interactions between pedestrians. We argue that alimitation of earlier articles comes from the use of recurrentarchitectures, which are parameter inefficient and expensivein training [3]. We overcome this limitation through the useof temporal convolutional architectures.In addition to the limitation of recurrent architectures,aggregation layers used in earlier works can also limit theirperformance. The aggregation layer takes the hidden statesof the recurrent units as inputs. It is expected to assimilate aglobal representation of the scene, since each recurrent unitmodels a pedestrian trajectory. However, there are two issues within this type of aggregation. First, the aggregation infeature states is neither intuitive nor direct in modelling interactions between people, as the physical meaning of featurestates is difficult to interpret. Second, since the aggregationmechanisms are usually based on heuristics like pooling,they could fail in modeling interactions between pedestrianscorrectly. For example, the pooling operation is known tobe leaky in information [26]. In order to directly capturethe interactions between pedestrians and predict future pathsfrom these, the recent article social-BiGAT [10] relies ona graph representation to model social interactions. As thetopology of graphs is a natural way to represent social interactions between pedestrians in a scene, we argue that it is amore direct, intuitive and efficient way to model pedestriansinteractions than aggregation based methods. We also arguethat social-BiGAT did not make the most of the graph representation, since they used it only as a pooling mechanismfor recurrent units states. Social-STGCNN benefits morefrom graph representation through modeling the scene withas spatio-temporal graph and performs on it.We designed Social-STGCNN to overcome the two aforementioned limitations. First, we model the pedestrians trajectories from the start as a spatio-temporal graph to replacethe aggregation layers. The graph edges model the socialinteractions between the pedestrians. We propose a weightedadjacency matrix in which the kernel function quantitativelymeasure the influence between pedestrians. To address issues associated with recurrent units, our model manipulatesover the spatio-temporal graph using a graph ConvolutionalNeural Networks (CNN)s and a temporal CNNs. This allows our model to predict the whole sequence in a singleshot. Due to the above design, our model outperforms previous models in terms of prediction accuracy, parameters size,inference speed and data efficiency.2. Related workThe recent interest in autonomous driving has lead to increasing focus on pedestrian trajectory prediction. Recently,new deep models are making promising progresses on thistask. In this section, we give a brief review of related work.Human trajectory prediction using deep models SocialLSTM [1] is one of the earliest deep model focusing onpedestrian trajectory prediction. Social-LSTM uses a recurrent network to model the motion of each pedestrian,then they aggregated the recurrent outputs using a poolingmechanism and predict the trajectory afterwards. SocialLSTM assumes the pedestrian trajectory follow a bi-variateGaussian distribution, in which we follow this assumptionin our model. Later works such as Peek Into The Future(PIF) [14] and State-Refinement LSTM (SR-LSTM) [30] extends [1] with visual features and new pooling mechanismsto improve the prediction precision. It is noticeable thatSR-LSTM [30] weighs the contribution of each pedestrianto others via a weighting mechanism. It is similar to the ideain Social-BiGAT [10] which uses an attention mechanismto weigh the contribution of the recurrent states that represent the trajectories of pedestrians. Based on the assumptionthat pedestrian trajectories follow multi-modal distributions,Social-GAN [6] extends Social LSTM [1] into a RecurrentNeural Network (RNN) based generative model. Sophie [23]used a CNNs to extract the features from the scene as a wholethen a two way attention mechanism is used per pedestrian.Later on, Sophie concatenates the attention outputs withthe visual CNN outputs then a Long Short Term Memory(LSTM) autoencoder based generative model is used to generate the future trajectories. The work CGNS [13] is similarto Sophie [23] in terms of the architecture but they used aGated Recurrent Units(GRU)s instead of LSTMs. We noticethat most previous works were circulating around two ideas,model each pedestrian motion using a recurrent net and combine the recurrent nets using a pooling mechanism. Recentwork Social-BiGAT [10] relies on graph attention networksto model the social interactions between pedestrians. TheLSTM outputs are fed to the graph in Social-BiGAT. One keydifference between our model Social-STGCNN and SocialBiGAT is that we directly model pedestrian trajectories as agraph from the beginning, where we give meaningful valuesfor vertices.Recent Advancements in Graph CNNs Graph CNNs wereintroduced by [8] which extends the concept of CNNs intographs. The Convolution operation defined over graphs isa weighted aggregation of target node attributes with theattributes of its neighbor nodes. It is similar to CNNs but theconvolution operation is taken over the adjacency matrix ofthe graphs. The works [9, 4, 24] extend the graph CNNs toother applications such as matrix completion and VariationalAuto Encoders. One of the development related to our workis the ST-GCNN [27]. ST-GCNN is a spatio-temporal Graph14425

CNN that was originally designed to solve skeleton-basedaction recognition problem. Even though the architectureitself was designed to work on a classification task, we adaptit to suit our problem. In our work, ST-GCNNs extract bothspatial and temporal information from the graph creating asuitable embedding. We then operate on this embedding topredict the trajectories of pedestrians. Details are shown insection 4.Temporal Convolutional Neural Networks (TCNs) Starting from [3], the argue between the usage of Recurrent Neural Networks (RNN)s versus the usage of temporal CNNsfor sequential data modeling is highlighted. Introducedby [3], Temporal Convolutional Neural Networks(TCNs)take a stacked sequential data as input and predict a sequenceas a whole. This could alleviate the problem of error accumulating in sequential predictions made by RNNs. What ismore, TCNs are smaller in size compared to RNNs. We wereinspired by TCNs and designed a temporal CNN model thatextends the capabilities of ST-GCNNs. More details aboutthis are in the model description section 4.3. Problem FormulationGiven a set of N pedestrians in a scene with their corresponding observed positions tron , n {1, . . . , N } overa time period To , we need to predict the upcoming trajectories trpn over a future time horizon Tp . For a pedestriann, we write the corresponding trajectory to be predicted astrpn { pnt (xnt , ytn ) t {1, . . . , Tp }}, where (xnt , ytn )are random variables describing the probability distributionof the location of pedestrian n at time t, in the 2D space. Wemake the assumption that (xnt , ytn ) follows bi-variate Gaussian distribution such that pnt N (µnt , σtn , ρnt ). Besides,we denote the predicted trajectory as p̂nt which follows theestimated bi-variate distribution N (µ̂nt , σ̂tn , ρ̂nt ). Our modelis trained to minimize the negative log-likelihood, whichdefined as:nL (W) TpXare a compact representation of the observed pedestrian trajectory history. TXP-CNN takes these features as inputs andpredicts the future trajectories of all pedestrians as a whole.We use the name Time-Extrapolator because TXP-CNNsare expected to extrapolate future trajectories through convolution operation. Figure 2 illustrates the overview of themodel.Graph Representation of Pedestrian Trajectories Wefirst introduce the construction of the graph representationof pedestrian trajectories. We start by constructing a setof spatial graphs Gt representing the relative locations ofpedestrians in a scene at each time step t. Gt is defined asGt (Vt , Et ), where Vt {vti i {1, . . . , N }} is theset of vertices of the graph Gt . The observed location (xit , yti )is the attribute of vti . Et is the set of edges within graph Gtwhich is expressed as Et {eijt i, j {1, . . . , N }}.jieijifvandvareconnected,eij 1ttt 0 otherwise. Intorder to model how strongly two nodes could influence witheach other, we attach a value aijt , which is computed byijsome kernel function for each eijt . at s are organized intothe weighted adjacency matrix At . We introduce aijsim,t as akernel function to be used within the adjacency matrix At .aijsim,t is defined in equation 2. We discuss the details of Atkernel function later in section 6.1.aijsim,t(l 1) σ(in which W includes all the trainable parameters of themodel, µnt is the mean of the distribution,σtn is the variancesand ρnt is the correlation.The Social-STGCNN model consists of two main parts:the Spatio-Temporal Graph Convolution Neural Network(ST-GCNN) and the Time-Extrapolator Convolution Neural Network (TXP-CNN). The ST-GCNN conducts spatiotemporal convolution operations on the graph representationof pedestrian trajectories to extract features. These features(2)k XkX(p(z (l) , h, w)).w(l) (h, w))(3)where k is the kernel size and p(.) is the sampling function which aggregates the information of neighbors centeringaround z [5] and σ is an activation function and (l) indicateslayer l.The graph convolution operation is defined as:4. The Social-STGCNN Model4.1. Model Description, kvti vtj k2 6 0, Otherwise.h 1 w 1(1)t 11/kvti vtj k20Graph Convolution Neural Network With the graph representation of pedestrian trajectories, we introduce the spatialconvolution operation defined on graphs. For convolutionoperations defined on 2D grid maps or feature maps, theconvolution operation is shown in equation 3.zlog(P((pnt µ̂nt , σ̂tn , ρ̂nt )) (v i(l 1) σ(1ΩXp(v i(l) , v j(l) ).w(v i(l) , v j(l) ))v j(l) B(v i(l) )(4)where Ω1 is a normalization term, B(v i ) {v j d(v i , v j ) D} is the neighbor set of vertices v i and d(v i , v j ) denotesthe shortest path connecting v i and v j . Note that Ω is thecardinality of the neighbor set. Interested readers are referredto [8, 27] for more detailed explanations and reasoning.14426

Figure 2. The Social-STGCNN Model. Given T frames, we construct the spatio-temporal graph representing G (V, A). Then G isforwarded through the Spatio-Temporal Graph Convolution Neural Networks (ST-GCNNs) creating a spatio-temporal embedding. Followingthis, the TXP-CNNs predicts future trajectories. P is the dimension of pedestrian position, N is the number of pedestrians, T is the numberof time steps and P̂ is the dimensions of the embedding coming from ST-GCNN.Spatio-Temporal Graph Convolution Neural Network(ST-GCNNs) ST-GCNNs extends spatial graph convolution to spatio-temporal graph convolution by defining anew graph G whose attributes are the set of the attributesof Gt . G incorporates the spatio-temporal information ofpedestrian trajectories. It is worth noticing that the topology of G1 , . . . , GT is the same, while different attributesare assigned to vti when t varies. Thus, we define Gas (V, E), in which V {v i i {1, . . . , N }} andE {eij i, j {1, . . . , N }}. The attributes of vertex v i in G is the set of vti , t {0, . . . , T }. In addition, theweighted adjacency matrix A corresponding to G is the setof {A1 , . . . , AT }. We denote the embedding resulting fromST-GCNN as V̄ .Time-Extrapolator Convolution Neural Network (TXPCNN) The functionality of ST-GCNN is to extract spatiotemporal node embedding from the input graph. However,our objective is to predict further steps in the future. Wealso aim to be a stateless system and here where the TXPCNN comes to play. TXP-CNN operates directly on thetemporal dimension of the graph embedding V̄ and expandsit as a necessity for prediction. Because TXP-CNN dependson convolution operations on feature space, it is less in parameters size compared to recurrent units. A property tonote regards TXP-CNN layer that it is not a permutationinvariant as changes in the graph embedding right beforeTXP-CNN leads to different results. Other than this, if theorder of pedestrians is permutated starting from the input toSocial-STGCNN then the predictions are invariant.Overall, there are two main differences between SocialSTGCNN and ST-GCNN [27]. First, Social-STGCNN constructs the graph in a totally different way from ST-GCNNwith a novel kernel function. Second, beyond the spatiotemporal graph convolution layers, we added the flexibilityin manipulating the time dimension using the TXP-CNN.ST-GCNN was originally designed for classification. Byusing TXP-CNN, our model was able to utilize the graphembedding originating from ST-GCNN to predict the futureetrajectories.4.2. Implementing Social-STGCNNSeveral steps are necessary to implement the model correctly. We first normalize the adjacency matrix for theease of learning. The adjacency matrix A is a stack of{A1 , . . . , AT }, we symmetrically normalize each At usingthe following form [8] 1 21At Λt 2 Ât Λtwhere Ât At I and Λt is the diagonal node degreematrix of Ât . We use  and Λ to denote the stack of Ât andΛt respectively. The normalization of adjacency is essentialfor the graph CNN to work properly, as outlined in [8]. Wedenote the vertices values at time step t and network layer l(l)(l)as Vt . Suppose V (l) is the stack of Vt . With the abovedefinitions, we can now implement the ST-GCNN layersdefined in equation 4 as follows.:11f (V (l) , A) σ(Λ 2 ÂΛ 2 V (l) W(l) )(5)where W(l) is the matrix of trainable parameters at layer l.After applying the ST-GCNN, we have features that compactly represent the graph. The TXP-CNN receives featuresV̄ and treats the time dimension as feature channels. TheTXP-CNN is made up of a series of residual connectedCNNs. Only the first layer in TXP-CNN does not have aresidual connection as it receives V̄ from the ST-GCNNs, inwhich they differ in terms of the dimensions of the observedsamples and the samples to be predicted.5. Datasets and Evaluation MetricsThe model is trained on two human trajectory prediction datasets: ETH [21] and UCY [11]. ETH contains twoscenes named ETH and HOTEL, while UCY contains threescenes named ZARA1, ZARA2 and UNIV. The trajectoriesin datasets are sampled every 0.4 seconds. Our method of14427

training follows the same strategy as Social-LSTM [1]. InSocial-LSTM, the model was trained on a portion of a specific dataset and tested against the rest and validated versusthe other four datasets. When being evaluated, the modelobserves the trajectory of 3.2 seconds which corresponds to8 frames and predicts the trajectories for the next 4.8 secondsthat are 12 frames.Two metrics are used to evaluate model performance: theAverage Displacement Error (ADE) [21] defined in equation 6 and the Final Displacement Error (FDE) [1] definedin equation 7. Intuitively, ADE measures the average prediction performance along the trajectory, while the FDEconsiders only the prediction precision at the end points.Since Social-STGCNN generates a bi-variate Gaussian distribution as the prediction, to compare a distribution with acertain target value, we follow the evaluation method used inSocial-LSTM [1] in which 20 samples are generated basedon the predicted distribution. Then the ADE and FDE arecomputed using the closest sample to the ground truth. Thismethod of evaluation were adapted by several works such asSocial-GAN [6] and many more.P P nkp̂t pnt k2ADE FDE n N t Tp(6)N TpPkp̂nt pnt k2n NN, t Tp(7)6. Experiments and Results AnalysisModel configuration and training setup Social-STGCNNis composed of a series of ST-GCNN layers followed byTXP-CNN layers. We use PReLU[7] as the activation function σ across our model. We set a training batch size of 128and the model was trained for 250 epochs using StochasticGradient Descent (SGD). The initial learning rate is 0.01,and changed to 0.002 after 150 epochs. According to ourablation study in table 6, the best model to use has one STGCNN layer and five TXP-CNN layers. Furthermore, it isnoticeable that when the number of ST-GCNN layers increases, the model performance decreases. Apparently, thisproblem of going deep using graph CNN was noticed bythe work in [12], in which they proposed a method to solveit. Unfortunately, their solution does not extend to temporalgraphs.13573570.47 / 0.780.59 / 1.020.62 / 1.070.75 / 1.280.47 / 0.840.52 / 0.920.57 / 0.980.75 / 1.270.44 / 0.750.54 / 0.930.59 / 1.020.62 / 1.070.48 / 0.870.54 / 0.920.59 / 0.980.75 /1.28Table 1. Ablation study of the Social-STGCNN model. The firstrow corresponds to the number of TXP-CNN layers. The firstcolumn from the left corresponds to the number of ST-GCNNlayers. We show the effect of different configurations of SocialSTGCNN on the ADE/FDE metric. The best setting is to use onelayer for ST-GCNN and five layers for TXP-CNN.tions to each other in the convolution operations. The kernelfunction can thus be considered as a prior knowledge aboutthe social relations between pedestrians. A straightforwardidea in designing the kernel function is to use the distancemeasured by the L2 norm defined in equation 8 betweenpedestrians to model their impacts to each other. However,this is against the intuition that the pedestrians tend to beinfluenced more by closer ones. To overcome this, we usesimilarity measure between the pedestrians. One of the proposals is to use the inverse of L2 norm as defined in equation10. The ǫ term is added in denominator to ensure numericalstability. Another candidate function is the Gaussian RadialBasis Function [25], shown in equation 9. We compare theperformance of these kernel functions through experiments.The case that all the values in adjacency matrix betweendifferent nodes are set to one is used as a baseline.According to results listed in table 6.1, the best performance comes from aijsim,t defined in function 2. The difference between functions 10 and 2 exists in the case wherekvti vtj k2 0. In function 2, we set aijsim,t 0 whenkvti vtj k2 0 because it is assumed that the two pedestrians can be viewed as the same person when they staytogether. Without it, the model will have an ambiguity in therelationship between pedestrians. For this, we use aijsim,t inthe definition of the adjacency matrix in all of our experiments.jiaijL2 ,t kvt vt k2aijexp,t 6.1. Ablation Study of Kernel FunctionIn this section, our objective is to find a suitable kernelfunction to construct the weighted adjacency matrix. Theweighted adjacency matrix At is a representation of thegraph edges attributes. The kernel function maps attributesijat vti and vtj to a value aijt attached to et . In the implementation of Social-STGCNN , At weights the vertices contribu-1exp ( kvti vtj k2 )σaijsimǫ ,t 1kvti vtj k2 ǫ(8)(9)(10)6.2. Quantitative AnalysisThe performance of Social-STGCNN is compared withother models on ADE/FDE metrics in table 2. Overall,14428

ETHHOTELUNIVZARA1ZARA2AVGLinear * [1]SR-LSTM-2 * [30]S-LSTM [1]S-GAN-P [6]SoPhie [23]CGNS [13]PIF [14]STSGN [29]GAT [10]Social-BiGAT [10]1.33 / 2.940.63 / 1.251.09 / 2.350.87 / 1.620.70 / 1.430.62 / 1.400.73 / 1.650.75 / 1.630.68 / 1.290.69 / 1.290.39 / 0.720.37 / 0.740.79 / 1.760.67 / 1.370.76 / 1.670.70 / 0.930.30 / 0.590.63 / 1.010.68 / 1.400.49 / 1.010.82 / 1.590.51 / 1.100.67 / 1.400.76 / 1.520.54 / 1.240.48 / 1.220.60 / 1.270.48 / 1.080.57 / 1.290.55 / 1.320.62 / 1.210.41 / 0.900.47 / 1.000.35 / 0.680.30 / 0.630.32 / 0.590.38 / 0.810.30 / 0.650.29 / 0.600.30 / 0.620.77 / 1.480.32 / 0.700.56 / 1.170.42 / 0.840.38 / 0.780.35 / 0.710.31 / 0.680.26 / 0.570.37 / 0.750.36 / 0.750.79 / 1.590.45 / 0.940.72 / 1.540.61 / 1.210.54 / 1.150.49 / 0.970.46 / 1.000.48 / 0.990.52 / 1.070.48 / 1.00Social-STGCNN0.64 / 1.110.49 / 0.850.44 / 0.790.34 / 0.530.30 / 0.480.44 / 0.75S-GANOurs Kernel 2 Ours Kernel 1Table 2. ADE / FDE metrics for several methods compared to Social-STGCNN are shown. The models with * mark are non-probabilistic.The rest of models used the best amongst 20 samples for evaluation. All models takes as an input 8 frames and predicts the next 12 frames.We notice that Social-STGCNN have the best average error on both ADE and FDE metrics. The lower the better.12345Ground truthObservedPredictionFigure 3. Qualitative analysis of Social-STGCNN . We compare models trained with different kernel functions (Kernel 1: equation 8 andKernel 2: equation 2) versus previous models. Social-GAN [6] is taken as a baseline for the comparison. Illustration scenes are from theETH [21] and UCY [11] datasets. We used the pre-trained Social-GAN model provided by [6]. A variety of scenarios are shown: twoindividuals walking in parallel (1)(2), two persons meeting from the same direction (3), two persons meeting from different directions (4)and one individual meeting another group of pedestrians from an angle (5). For each case, the dashed line is the true trajectory that thepedestrians are taking and the color density is the predicted trajectory distribution.Social-STGCNN outperforms all previous methods on thetwo metrics. The previous state of art on the FDE metricis SR-LSTM [30] with an error of 0.94. Our model has anerror of 0.75 on the FDE metric which is about 20% lessthan the state of the art. The results in qualitative analysisexplains how Social-STGCNN encourages social behaviorsthat enhanced the FDE metric. For the ADE metric, SocialSTGCNN is slightly better than the state-of-art SR-LSTM by2%. Also, it is better than the previous generative methodswith an improvement ranging in between 63% compared toS-LSTM [1] and 4% compared to PIF [14]. Interestingly, ourmodel without the vision signal that contains scene contextoutperforms methods that utilized it such as SR-LSTM, PIFand Sophie.Inference speed and model size S-GAN-P [6] previouslyhad the smallest model size with 46.3k parameters. Thesize of Social-STGCNN is 7.6K parameters only which isabout one sixth of the number of parameters in S-GAN-P.In terms of inference speed, S-GAN-P was previously thefastest method with an inference time of 0.0968 seconds perinference step. The inference time of our model is 0.002seconds per inference step which is about 48 faster thanS-GAN-P. Table 6 lists out the speed comparisons betweenour model and publicly available models which we couldbench-mark against. We achieved these results because weovercame the two limitations of previous methods whichused recurrent architecture and aggregation mechanisms viathe design of our model.Data Efficiency In this section, we evaluate if the efficiencyin model size leads to a better efficiency in learning from14429

MeetObservedPredictionMergeGround truthSpeedBadFailuresDirFigure 4. The first column is the ground truth, while the other columns illustrate samples from our model. The first two rows show twodifferent scenarios where pedestrians merge into a direction or meet from opposite directions. The second and third columns show changesin speed or direction in samples from our model. The last column shows undesired behaviors. The last row show failed samples.Social-STGCNN7.6K0.0020Table 3. Parameters size and inference time of different modelscompared to ours. The lower the better. Models were bench-markedusing Nvidia GTX1080Ti GPU. The inference time is the averageof several single inference steps. We notice that Social-STGCNNhas the least parameters size compared and the least inference timecompared to others. The text in blue show how many times ourmodel is faster than others.Kernel functionADE / FDEaijL2 ,taijexp,taijsimǫ,tJust ones0.48 / 0.840.50 / 0.840.48 / 0.880.49 / 0.79aijsim,t0.44 / 0.75Table 4. The effect of different kernel functions for the adjacencymatrix At over the Social-STGCNN performance.1.41.2FDE SOTAADE SOTAFDE OursFDE S-GANADE OursADE S-GAN1.00.80.60.4Training data %fewer samples of the data. We ran a series of experimentswhere 5%, 10%, 20% and 50% of the training data. Thetraining data were randomly selected. Once selected, wefed the same data to train different models. Social-GAN isemployed as a comparison baseline because it has least trainable parameters amongst previous deep models. Figure 6.21001.1789 (589x)0.1578 (78.9x)0.0968 (48.4x)0.1145 (57.3x)80264K (35x)64.9K (8.5x)46.3K (6.1x)360.3K (47x)60S-LSTM [1]SR-LSTM-2 [30]S-GAN-P [6]PIF [14]shows the data learning efficiency experiments results withmean and error. We notice that our model exceeds the stateof the art on the FDE metric when only 20% of training datais used. Also, Social-STGCNN exceeds the performanceof Social-GAN on the ADE metric when trained only onwith 20% of the training data. The results also show thatS-GAN-P did not improve much in performance with moretraining data, unlike the present model. It is an interestingphenomenon that S-GAN-P does not absorb more trainingdata. We assume that this behavior is due to the fact thatGANs are data efficient because they can learn a distributionfrom few training samples. However, the training of GANscan easily fall into the problem of mode collapse. In comparison, the data efficiency of our model comes from theparameter efficiency.40Inference time20Parameters countFigure 5. Model performance versus shrinked training dataset. Thex-axis shows several randomly samples shrink percentages. Theshade represents errors. The same shrinked data were used a

Human trajectory prediction using deep models Social-LSTM [1] is one of the earliest deep model focusing on pedestrian trajectory prediction. Social-LSTM uses a re-current network to model the motion of each pedestrian, then they aggregated the recurrent outputs using a pooling mechanism and predict the trajectory afterwards. Social-

Related Documents:

The current efforts to process big spatio-temporal data on MapReduce en-vironment either use: (a) General purpose distributed frameworks such as . operations on highly skewed data. ST-Hadoop is designed as a generic MapReduce system to support spatio-temporal queries, and assist developers in implementing a wide selection of spatio- .

source MapReduce framework with a native support for spatio-temporal data. ST-Hadoop is a comprehensive extension to Hadoop and Spatial-Hadoop that injects spatio-temporal data awareness inside each of their layers, mainly, language, indexing, and operations layers. In the language layer, ST-Hadoop provides built in spatio-temporal data types .

An Empirical Investigation of Efficient Spatio-Temporal Modeling in Video Restoration Yuchen Fan, Jiahui Yu, Ding Liu, Thomas S. Huang University of Illinois Urbana-Champaign {yuchenf4, jyu79, dingliu2, t-huang1}@illinois.edu Abstract We present a comprehensive empirical investigation of efficient spatio-temporal modeling in video restoration .

remote sensing Article Spatio-Temporal Changes and Driving Forces of Vegetation Coverage on the Loess Plateau of Northern Shaanxi Tong Nie 1,2,3, Guotao Dong 3,* , Xiaohui Jiang 1,2 and Yuxin Lei 1,2 Citation: Nie, T.; Dong, G.; Jiang, X.; Lei, Y. Spatio-Temporal Changes and Driving Forces of Vegetation Coverage on the Loess Plateau of Northern .

Although hierarchical Bayesian models for spatio-temporal dynamical problems such as pop-ulation spread are relatively easy to specify, there are a number of complicating issues. First and foremost is the issue of computation. Hierarchical Bayesian models are most often implemented with Markov Chain Monte Carlo (MCMC) methods.

22 response and covariates, needed in analysis of spatial time series or spatio-temporal data in 23 applications. 24 Study of nonlinear spatio-temporal modeling is still rather rare (Cressie and Wikle (2011), 25 pp. 437). In contrast, nonlinear analysis of time series data have been well studied in 26 the literature

of different motion styles - Sec 4.2 ). We evaluate the pro-posed approach on a diverse set of spatio-temporal prob-lems (human pose modeling and forecasting, human-object interaction, and driver decision making), and show signif-icant improvements over the state of the art on each p

Artificial intelligence (AI) – a broad concept used in policy discussions to refer to many different types of technology – greatly influences and impacts the way people seek, receive, impart and access information and how they exercise their right to freedom of expression in the digital ecosystem. If implemented responsibly, AI can benefit societies, but there is a genuine risk that its .