A Fast and Robust BERT-based Dialogue State Tracker forSchema-Guided Dialogue DatasetVahid NorooziYang Zhangvnoroozi@nvidia.comNVIDIA, USAyangzhang@nvidia.comNVIDIA, USAEvelina BakhturinaTomasz Kornutaebakhturina@nvidia.comNVIDIA, USAtkornuta@nvidia.comNVIDIA, USAABSTRACTDialog State Tracking (DST) is one of the most crucial modules forgoal-oriented dialogue systems. In this paper, we introduce FastSGT(Fast Schema Guided Tracker), a fast and robust BERT-based modelfor state tracking in goal-oriented dialogue systems. The proposedmodel is designed for the Schema-Guided Dialogue (SGD) datasetwhich contains natural language descriptions for all the entitiesincluding user intents, services, and slots. The model incorporatestwo carry-over procedures for handling the extraction of the valuesnot explicitly mentioned in the current user utterance. It also usesmulti-head attention projections in some of the decoders to have abetter modelling of the encoder outputs.In the conducted experiments we compared FastSGT to the baseline model for the SGD dataset. Our model keeps the efficiency interms of computational and memory consumption while improving the accuracy significantly. Additionally, we present ablationstudies measuring the impact of different parts of the model onits performance. We also show the effectiveness of data augmentation for improving the accuracy without increasing the amount ofcomputational resources.KEYWORDSgoal-oriented dialogue systems, dialogue state tracking, schemaguided dialoguesACM Reference Format:Vahid Noroozi, Yang Zhang, Evelina Bakhturina, and Tomasz Kornuta. 2020.A Fast and Robust BERT-based Dialogue State Tracker for Schema-GuidedDialogue Dataset. In Proceedings of KDD Workshop on Conversational SystemsTowards Mainstream Adoption (KDD Converseβ20). ACM, New York, NY, USA,8 pages.1INTRODUCTIONGoal-oriented dialogue systems is a category of dialogue systemsdesigned to solve one or multiple specific goals or tasks (e.g. flightPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).KDD Converseβ20, August 2020, 2020 Copyright held by the owner/author(s).reservation, hotel reservation, food ordering, appointment scheduling) [21]. Traditionally, goal-oriented dialogue systems are set upas a pipeline with four main modules: 1-Natural Language Understanding (NLU), 2-Dialogue State Tracking (DST), 3-Dialog PolicyManager, and 4-Response Generator. NLU extracts the semanticinformation from each dialogue turn which includes e.g. user intents and slot values mentioned by user or system. DST takes theextracted entities to build the state of the user goal by aggregatingand tracking the information across all turns of the dialogue. Dialog Policy Manager is responsible for deciding the next action ofthe system based on the current state. Finally, Response Generatorconverts the system action into human natural text understandableby the user.The NLU and DST modules have shown to be successfully trainedusing data-driven approaches [21]. In the most recent advances inlanguage understanding, due to models like BERT [3], researchershave successfully combined NLU and DST into a single unifiedmodule, called Word-Level Dialog State Tracking (WL-DST) [7, 14,18]. The WL-DST models can take the user or system utterances innatural language format as input and predict the state at each turn.The model we are going to propose in this paper falls into this classof algorithms.Most of the previously published public datasets, such as MultiWOZ [2] or M2M [16], use a fixed list of defined slots for eachdomain without any information on the semantics of the slots andother entities in the dataset. As a result, the systems developed onthese datasets fail to understand the semantic similarity betweenthe domains and slots. The capability of sharing the knowledgebetween the slots and domains might help a model to work acrossmultiple domains and/or services, as well as to handle the unseenslots and APIs when the new APIs and slots are similar in functionality to those present in the training data.The Schema-Guided Dialogue (SGD) dataset [14] was createdto overcome these challenges by defining and including schemasfor the services. A schema can be interpreted as an ontology encompassing naming and definition of the entities, properties andrelations between the concepts. In other words, schema defines notonly the structure of the underlying data (relations between all theservices, slots, intents and values), but also provides descriptionsof most of the entities expressed in a natural language. As a result,the dialogue systems can exploit that rich information to capturemore general semantic meanings of the concepts. Additionally, theavailability of the schema enables the model to use the power ofpre-trained models like BERT to transfer or share the knowledgeCopyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
KDD Converseβ20, August 2020,Noroozi et al.between different services and domains. The recent emergence ofthe SGD dataset has triggered a new line of research on dialoguesystems based on schemas, e.g. [1, 4, 9, 15].Many state-of-the-art models proposed for the SGD dataset, despite showing impressive performance in terms of accuracy, appearnot to be very efficient in terms of computational complexity andmemory consumption, e.g. [9, 11, 12, 19]. To address these issues,we introduce a fast and flexible WL-DST model called Fast SchemaGuided Tracker (FastSGT)1 . Main contributions of the paper are asfollows: FastSGT is able to predict the whole state at each turn withjust a single pass through the model, which lowers both thetraining and inference time. The model employs carry-over mechanisms for transferringthe values between slots, enabling switching between services and accepting the values offered by the system duringdialogue. We propose an attention-based projection to attend overall the tokens of the main encoder to be able to model theencoded utterances better. We evaluate the model on the SGD dataset [14] and show thatour model has significantly higher accuracy when comparedto the baseline model of SGD, at the same time keeping theefficiency in terms of computation and memory utilization. We show the effectiveness of augmentation on the SGDdataset without increasing the training steps.2RELATED WORKSThe availability of schema descriptions for services, intents andslots enables the NLU/DST models to share and transfer knowledge between different services that have similar slots and intents.Considering the recent advances in natural language understanding and rise of Transformer-based models [17] like BERT [3] orRoBERTa [10], it looks like a promising approach for training a unified model on datasets which are aggregated from different sources.We categorize all models proposed for the SGD dataset into twomain categories: multi-pass and single-pass models.2.1Multi-pass ModelsThe general principle of operation of multi-pass models [6, 11, 12,15, 19] lies in passing of descriptions of every slots and intents asinputs to the BERT-like encoders to produce their embeddings. Asa result encoders are executed several times per a single dialog turn.Passing the descriptions to the model along with the user or systemutterances enables the model to have a better understanding of thetask and facilitates learning of similarity between intents and slots.The SPDD model [12] is a multi-pass model which showed one ofthe highest performances in terms of accuracy on the SGD dataset.For instance, in order to predict the user state for a service with4 intents and 10 slots and 3 slots being active in a given turn, thismodel needs 27 passes through the encoder (4 for intents, 10 forrequested status, 10 for statuses, and 3 for values). Such approacheshandle unseen services well and achieve high accuracy, but seemnot to be practical in many cases when time or resources are limited.1 Sourcecode of the model is publicly available at: https://github.com/NVIDIA/NeMoOne obvious disadvantage of multi-pass models is their lackof efficiency. The other disadvantage is the memory consumption.They typically use multiple BERT-like models (e.g. five in SPDD) forpredicting intents, requested slots, slot statuses, categorical values,and non-categorical values. This significantly increases the memoryconsumption compared to most of the single-pass models with asingle encoder.2.2Single-pass ModelsThe works that incorporate the single-pass approach [1, 14] relyon BERT-like models to encode the descriptions of services, slots, intents and slot values into representations, called schema embeddings.The main difference lies in the fact that this procedure is executedjust once, before the actual training starts, mitigating the needto pass the descriptions through the model for each one of theturns/predictions.While these models are very efficient and robust in terms oftraining and inference time, they have shown significantly lowerperformance in terms of accuracy compared to the multi-pass approaches. On the other hand, multi-pass models need significantlyhigher computation resource for training and inference, and alsothe usage of additional BERT-based encoders increases the memoryusage drastically.3THE FASTSGT MODELThe FastSGT (Fast Schema Guided Tracker) model belongs to thecategory of single-pass models, keeping the flexibility along withmemory and computational efficiency. Our model is based on thebaseline model proposed for SGD [14] with some improvements inthe decoding modules. The model architecture is illustrated in Fig. 1.It consists of four main modules: 1-Utterance Encoder, 2-SchemaEncoder, 3-State Decoder, and 4-State Tracker. The first threemodules constitute the NLU component and are based on neuralnetworks, whereas the state tracker is a rule-based module. Weused BERT [3] for both encoders in our model, but similar modelslike RoBERTa [10] or XLNet [20] can also be used.Assume we have a dialogue of π turns. Each turn consists of thepreceding system utterance (ππ‘ ) and the user utterance (ππ‘ ). Letπ· {(π 1, π 1 ), (π 1, π 2 ), ., (π π , π π )} be the collection of turns inthe dialogue.The Utterance Encoder is a BERT model which encodes theuser and system utterances at each turn. The Schema Encoderis also a BERT model which encodes the schema descriptions ofintents, slots, and values into schema embeddings. These schemaembeddings help the decoders to transfer or share knowledge between different services by having some language understanding ofeach slot, intent, or value. The schema and utterance embeddingsare passed to the State Decoder - a multi-task module. This moduleconsists of five sub-modules producing the information necessaryto track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of theState Decoder and predicts the current state of the dialogue byaggregating and summarizing the information across turns. Detailsof all model components are presented in the following subsections.
KDD Converseβ20, August 2020,A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue DatasetPrevious stateNext stateSlots: {city: San Jose,restaurant name: Billy Berkβs}Intent: FindRestaurants}IntentDecoderSlots: {city: San Jose,restaurant name: Billy Berkβs,party size:2}Intent: FindRestaurants}State Slot StatusDecoderNon-categoricalValue DecoderSchemaembeddingMemorySchema EncoderUtteranceEncoder[CLS] A leading provider for restaurant search and reservations [SEP] Find a restaurant of a particular cuisine in a city [SEP] [CLS] A leading provider for restaurant search and reservations [SEP] Party size for a reservation [SEP] [CLS] Party size for a reservation [SEP] 1 [SEP] [CLS] How many people? [SEP] Please find a table for two people. [SEP]Figure 1: The overall architecture of FastSGT (Fast Schema Guided Tracker) with exemplary inputs from a restaurant service.3.1Utterance EncoderThis module is responsible for encoding of the current turn of thedialogue. At each turn (ππ‘ , ππ‘ ), the preceding system utteranceis concatenated with the user utterance separated by the specialtoken of [ππΈπ], resulting in (ππ‘ ) which is serves as input into theutterance encoder module:ππ‘ [πΆπΏπ] ππ‘ [ππΈπ] ππ‘ [ππΈπ](1)The output of the first token passed to the encoder is denotedas ππππ and is interpreted as a sentence-level representation of theturn, whereas the token-level representations are denoted as ππ‘ππ 1 , π 2 , ., π π ], where M is the total number of tokens in π chema EncoderThe Schema Encoder uses the descriptions of intents, slots, andservices to produce some embedding vectors which represent thesemantics of slots, intents and slot values. To build these schemarepresentations we instantiate a BERT model with the same weightsas the Utterance Encoder. However, this module is used just once,before the training starts, and all the schema embeddings are storedin a memory to be reused during training. This means, they willbe fixed during the training time. This approach of handling theschema embeddings is one of the main reasons behind the efficiency of our model compared to the multi-pass models in terms ofcomputation time.We used the same approach introduced in [14] for encoding theschemas. For a service with ππΌ intents, ππΆ categorical slots andπ ππΆ non-categorical slots, the representation of the intents aredenoted as πΌπ , 1 π ππΌ . Schema embeddings for the categoricaland non-categorical slots are indicated as πππΆ , 1 π ππΆ , and ππππΆ ,1 π π ππΆ respectively. The embeddings for the values of theπ-th categorical slot of a service with πππ possible values is denotedas πππ , 1 π πππ .Generally, the input to the Schema Encoder is the concatenation of two sequences with the [ππΈπ] token used as the separatorand the [πΆπΏπ] token indicating the beginning of the sequence.The Schema Encoder produces four types of schema embeddings: intents, categorical slots, non-categorical slots and categorical slot values. For a single intent embeddings πΌπ , the first sequenceis the corresponding service description and second one is the intent description. For each categorical πππΆ and non-categorical ππππΆslots embedding, the service description is concatenated with thedescription of the slot. To produce the schema embedding πππ forthe k-π‘β possible value of a categorical slot, the description of theslot is used as the first sequence along with the value itself as thesecond sequence.These sequences are given one by one to the Schema Encoderbefore the main training is started and the output of the first output token embedding ππππ is extracted and stored as the schemarepresentation, forming the Schema Embeddings Memory.
KDD Converseβ20, August 2020,Noroozi et al.3.3State DecoderThe Schema Embeddings Memory along with the outputs of theUtterance Encoder are used as inputs to the State Decoder topredict the values necessary for state tracking. The State Decodermodule consists of five sub-modules, each employing a set of projection transformations to decode their inputs. We use the twofollowing projection layers in the decoder sub-modules:1) Single-token projection: this projection transformation,which is introduced in [14], takes the schema embedding vector andthe ππππ of the Utterance Encoder as its inputs. The projection forπΎ (π₯, π¦; π) for twopredicting π outputs for task πΎ is defined as πΉ πΉπΆvectors π₯, π¦ π π as the inputs. π is the embedding size, π is the sizeof the output (e.g. number of classes), the first input π₯ is a schemaembedding vector, and π¦ is the sentence-level output embeddingvector produced by the Utterance encoder. The sources of theinputs π₯ and π¦ depend on the task and the sub-module. FunctionπΎ (π₯, π¦; π) for projection πΎ is defined as:πΉ πΉπΆβ 1 πΊπΈπΏπ (π1πΎ π¦ π πΎ1)β 2 πΊπΈπΏπ (π2πΎ (π₯πΎπΉ πΉπΆ(π₯, π¦; π) β1 ) ππΎ2)ππ π π‘πππ₯ (π3πΎ β 2 π πΎ3)(2)(3)(4)whereπππΎ , 1 π 3 and πππΎ , 1 π 3 are the learnable parametersfor the projection, and πΊπΈπΏπ is the activation function introducedin [5]. Symbol indicates the concatenation of two vectors. Softmaxfunction is used to normalize the outputs as a distribution over thetargets. This projection is used by the Intent, Requested Slot andNon-categorical Value Decoders.2) Attention-based projection: the single-token projectionjust takes one vector from the outputs of the Utterance Encoder.For the Slot Status Decoder and Categorical Value Decoder wepropose to use a more powerful projection layer based on multihead attention mechanism [17]. We use the schema embeddingvector π₯ as the query to attend to the token representations ππ‘ππas outputted by the Utterance Encoder. The idea is that domainspecific and slot-specific information can be extracted more efficiently from the collection of token-level representations thanjust from the sentence-level encoded vector ππππ . The multi-headπΎattention-based projection function πΉππ»π΄(π₯, ππ‘ππ ; π) for task πΎ toproduce targets with size π is defined as:β 1 MultiHeadAtt(ππ’πππ¦ π₯, πππ¦π ππ‘ππ , π£πππ’ππ ππ‘ππ )(5)is no active intent for the current turn. An embedding vector πΌ 0 isconsidered as the schema embedding for the πππ πΈ intent. It is alearnable embedding which is shared among all the services.The inputs to the Intent Decoder for a service are the schema embeddings πΌπ , 0 π ππΌ from the Schema Embeddings Memoryand ππππ of the Utterance Encoder. The predicted output of thissub-module is the active intent of the current turn πΌπππ‘ππ£π definedas:πΎπΌπππ‘ππ£π argmax πΉ πΉπΆ(πΌπ , ππππ ; π ππΌ )(7)0 π ππΌ3.3.2 Slot Request Decoder. At each turn, the user may requestinformation about a slot instead of informing the system about aslot value. For example, when a user asks for the flight time whenusing a ticket reservation service, π πππβπ‘ π‘πππ slot of the serviceis requested by the user. This is a binary prediction task for eachπΎ (π π , π ; π 1) slot. For this task, slot π π is requested when πΉ πΉπΆπ πππ 0.5, π π, π. The same prediction is done for both categorical andnon-categorical slots.3.3.3 Slot Status Decoder. We consider four different statusesfor each slot, namely: πππππ‘ππ£π, πππ‘ππ£π, ππππ‘ ππππ, πππππ¦ ππ£ππ . If thevalue of a slot has not changed from the previous state to thecurrent user state, then the slotβs status is "πππππ‘ππ£π". If a slotβsvalue is updated in the current userβs state into "ππππ‘ ππππ", thenthe status of the slot is set to "ππππ‘ ππππ" which means the userdoes not care about the value of this slot. If the value for the slot isupdated and its value is mentioned in the current user utterance,then its status is "πππ‘ππ£π". There are many cases where the value fora slot does not exist in the last user utterance and it comes fromprevious utterances in the dialogue. For such cases, the status isset to "πππππ¦ ππ£ππ " which means we should search the previoussystem or user utterances in the dialogue to find the value for thisslot. More details of the carry over mechanisms are described insubsection 3.4.The status of the categorical slot π is defined as:πΎππ argmax πΉππ»π΄(ππ , ππ‘ππ ; π 4)(8)0 π ππSimilar decoder is used for the status of the non-categorical slotsas:πΎππ argmax πΉππ»π΄(ππ , ππ‘ππ ; π 4)(9)0 π ππππΎ(6)πΉππ»π΄(π₯, ππ‘ππ ; π) ππ π π‘πππ₯ (π1πΎ β 1 π πΎ1)where ππ’ππ‘ππ»ππππ΄π‘π‘ is the multi-head attention function introduced in [17], and π1πΎ and π πΎ1 are learnable parameters of a linearprojection after the multi-head attention. To accommodate paddedutterances we use attention masking to mask out the padded portion
Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. We used BERT [3] for both encoders in our model, but similar models like RoBERTa [10] or XLNet [20] can also be used. Assume we have a dialogue of turns. Each .
How multilingual is Multilingual BERT? Telmo Pires Eva Schlinger Dan Garrette Google Research ftelmop,eschling,dhgarretteg@google.com Abstract In this paper, we show that Multilingual BERT (M-BERT), released byDevlin et al.(2019) as a single language model pre-trained from monolingual corpor
Differential Quadrature Method (DQM) is a powerful method which can be used to solve numerical problems in the analysis of structural and dynamical systems. In this study the governing . The finite difference method is a well- . (Bert et al. 1987, Striz et al. 1988, Shu and Richards 1992, Bert et al. 1993, Striz et al. 1995, Bert and Malik .
sentence-transformers 2With semanticallymeaningfulwe mean that semantically similar sentences are close in vector space. tic similarity comparison, clustering, and informa-tion retrieval via semantic search. BERT set new state-of-the-art performance on various sentence classiο¬cation and sente
4 Transfer Fine-Tuning with Paraphrasal Relation Injection We inject semantic relations between a sentence pair into a pre-trained BERT model through classi-ο¬cation of phrasal and sentential paraphrases. Af-ter the training, the model can be ο¬ne-tuned in ex-actly the same manner as with BERT models. 4.1 Overview
transfer learning. One important reference in this field is the BERT language representation model which serves as basis for many zero-shot cross-lingual transfer. Trained on the top 104 Wikipedia versions, multilin-gual BERT has proven competitive in many NLP tasks. [6] Despite not benefiting from cross-lingual
The quick eye diagram allows a one-shot check for a valid signal. Due to the higher sampling depth of a BERT, the eye contour lines visualize the measured eye at a deeper BER level for more accurate results. Extrapolated eye contour lines display the eye opening for even lower BER levels, su
The MikroTik Fast Path and Conntrack's work together gave the name Fast Track. Fast Track Fast Path extentions Only Ipv4 TCP/UDP (Total Traffic %99) FastTrack management is left to network admin FastTrack can be used on devices with Fast Path support. After the first packet of the connection passing through the router is marked as Fast Track .
Business Accounting Volume 1is the worldβs best-selling textbook on bookkeeping and accounting. Now in its tenth edition, it has become the standard introductory text for accounting students and professionals alike. New to this edition: Over 120 brand new review questions for exam practice Coverage of International Accounting Standards 2005 Additional and updated worked examples for areas of .