A Fast And Robust BERT-based Dialogue State Tracker For .

3y ago
29 Views
2 Downloads
932.31 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kaden Thurman
Transcription

A Fast and Robust BERT-based Dialogue State Tracker forSchema-Guided Dialogue DatasetVahid NorooziYang Zhangvnoroozi@nvidia.comNVIDIA, USAyangzhang@nvidia.comNVIDIA, USAEvelina BakhturinaTomasz Kornutaebakhturina@nvidia.comNVIDIA, USAtkornuta@nvidia.comNVIDIA, USAABSTRACTDialog State Tracking (DST) is one of the most crucial modules forgoal-oriented dialogue systems. In this paper, we introduce FastSGT(Fast Schema Guided Tracker), a fast and robust BERT-based modelfor state tracking in goal-oriented dialogue systems. The proposedmodel is designed for the Schema-Guided Dialogue (SGD) datasetwhich contains natural language descriptions for all the entitiesincluding user intents, services, and slots. The model incorporatestwo carry-over procedures for handling the extraction of the valuesnot explicitly mentioned in the current user utterance. It also usesmulti-head attention projections in some of the decoders to have abetter modelling of the encoder outputs.In the conducted experiments we compared FastSGT to the baseline model for the SGD dataset. Our model keeps the efficiency interms of computational and memory consumption while improving the accuracy significantly. Additionally, we present ablationstudies measuring the impact of different parts of the model onits performance. We also show the effectiveness of data augmentation for improving the accuracy without increasing the amount ofcomputational resources.KEYWORDSgoal-oriented dialogue systems, dialogue state tracking, schemaguided dialoguesACM Reference Format:Vahid Noroozi, Yang Zhang, Evelina Bakhturina, and Tomasz Kornuta. 2020.A Fast and Robust BERT-based Dialogue State Tracker for Schema-GuidedDialogue Dataset. In Proceedings of KDD Workshop on Conversational SystemsTowards Mainstream Adoption (KDD Converse’20). ACM, New York, NY, USA,8 pages.1INTRODUCTIONGoal-oriented dialogue systems is a category of dialogue systemsdesigned to solve one or multiple specific goals or tasks (e.g. flightPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).KDD Converse’20, August 2020, 2020 Copyright held by the owner/author(s).reservation, hotel reservation, food ordering, appointment scheduling) [21]. Traditionally, goal-oriented dialogue systems are set upas a pipeline with four main modules: 1-Natural Language Understanding (NLU), 2-Dialogue State Tracking (DST), 3-Dialog PolicyManager, and 4-Response Generator. NLU extracts the semanticinformation from each dialogue turn which includes e.g. user intents and slot values mentioned by user or system. DST takes theextracted entities to build the state of the user goal by aggregatingand tracking the information across all turns of the dialogue. Dialog Policy Manager is responsible for deciding the next action ofthe system based on the current state. Finally, Response Generatorconverts the system action into human natural text understandableby the user.The NLU and DST modules have shown to be successfully trainedusing data-driven approaches [21]. In the most recent advances inlanguage understanding, due to models like BERT [3], researchershave successfully combined NLU and DST into a single unifiedmodule, called Word-Level Dialog State Tracking (WL-DST) [7, 14,18]. The WL-DST models can take the user or system utterances innatural language format as input and predict the state at each turn.The model we are going to propose in this paper falls into this classof algorithms.Most of the previously published public datasets, such as MultiWOZ [2] or M2M [16], use a fixed list of defined slots for eachdomain without any information on the semantics of the slots andother entities in the dataset. As a result, the systems developed onthese datasets fail to understand the semantic similarity betweenthe domains and slots. The capability of sharing the knowledgebetween the slots and domains might help a model to work acrossmultiple domains and/or services, as well as to handle the unseenslots and APIs when the new APIs and slots are similar in functionality to those present in the training data.The Schema-Guided Dialogue (SGD) dataset [14] was createdto overcome these challenges by defining and including schemasfor the services. A schema can be interpreted as an ontology encompassing naming and definition of the entities, properties andrelations between the concepts. In other words, schema defines notonly the structure of the underlying data (relations between all theservices, slots, intents and values), but also provides descriptionsof most of the entities expressed in a natural language. As a result,the dialogue systems can exploit that rich information to capturemore general semantic meanings of the concepts. Additionally, theavailability of the schema enables the model to use the power ofpre-trained models like BERT to transfer or share the knowledgeCopyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

KDD Converse’20, August 2020,Noroozi et al.between different services and domains. The recent emergence ofthe SGD dataset has triggered a new line of research on dialoguesystems based on schemas, e.g. [1, 4, 9, 15].Many state-of-the-art models proposed for the SGD dataset, despite showing impressive performance in terms of accuracy, appearnot to be very efficient in terms of computational complexity andmemory consumption, e.g. [9, 11, 12, 19]. To address these issues,we introduce a fast and flexible WL-DST model called Fast SchemaGuided Tracker (FastSGT)1 . Main contributions of the paper are asfollows: FastSGT is able to predict the whole state at each turn withjust a single pass through the model, which lowers both thetraining and inference time. The model employs carry-over mechanisms for transferringthe values between slots, enabling switching between services and accepting the values offered by the system duringdialogue. We propose an attention-based projection to attend overall the tokens of the main encoder to be able to model theencoded utterances better. We evaluate the model on the SGD dataset [14] and show thatour model has significantly higher accuracy when comparedto the baseline model of SGD, at the same time keeping theefficiency in terms of computation and memory utilization. We show the effectiveness of augmentation on the SGDdataset without increasing the training steps.2RELATED WORKSThe availability of schema descriptions for services, intents andslots enables the NLU/DST models to share and transfer knowledge between different services that have similar slots and intents.Considering the recent advances in natural language understanding and rise of Transformer-based models [17] like BERT [3] orRoBERTa [10], it looks like a promising approach for training a unified model on datasets which are aggregated from different sources.We categorize all models proposed for the SGD dataset into twomain categories: multi-pass and single-pass models.2.1Multi-pass ModelsThe general principle of operation of multi-pass models [6, 11, 12,15, 19] lies in passing of descriptions of every slots and intents asinputs to the BERT-like encoders to produce their embeddings. Asa result encoders are executed several times per a single dialog turn.Passing the descriptions to the model along with the user or systemutterances enables the model to have a better understanding of thetask and facilitates learning of similarity between intents and slots.The SPDD model [12] is a multi-pass model which showed one ofthe highest performances in terms of accuracy on the SGD dataset.For instance, in order to predict the user state for a service with4 intents and 10 slots and 3 slots being active in a given turn, thismodel needs 27 passes through the encoder (4 for intents, 10 forrequested status, 10 for statuses, and 3 for values). Such approacheshandle unseen services well and achieve high accuracy, but seemnot to be practical in many cases when time or resources are limited.1 Sourcecode of the model is publicly available at: https://github.com/NVIDIA/NeMoOne obvious disadvantage of multi-pass models is their lackof efficiency. The other disadvantage is the memory consumption.They typically use multiple BERT-like models (e.g. five in SPDD) forpredicting intents, requested slots, slot statuses, categorical values,and non-categorical values. This significantly increases the memoryconsumption compared to most of the single-pass models with asingle encoder.2.2Single-pass ModelsThe works that incorporate the single-pass approach [1, 14] relyon BERT-like models to encode the descriptions of services, slots, intents and slot values into representations, called schema embeddings.The main difference lies in the fact that this procedure is executedjust once, before the actual training starts, mitigating the needto pass the descriptions through the model for each one of theturns/predictions.While these models are very efficient and robust in terms oftraining and inference time, they have shown significantly lowerperformance in terms of accuracy compared to the multi-pass approaches. On the other hand, multi-pass models need significantlyhigher computation resource for training and inference, and alsothe usage of additional BERT-based encoders increases the memoryusage drastically.3THE FASTSGT MODELThe FastSGT (Fast Schema Guided Tracker) model belongs to thecategory of single-pass models, keeping the flexibility along withmemory and computational efficiency. Our model is based on thebaseline model proposed for SGD [14] with some improvements inthe decoding modules. The model architecture is illustrated in Fig. 1.It consists of four main modules: 1-Utterance Encoder, 2-SchemaEncoder, 3-State Decoder, and 4-State Tracker. The first threemodules constitute the NLU component and are based on neuralnetworks, whereas the state tracker is a rule-based module. Weused BERT [3] for both encoders in our model, but similar modelslike RoBERTa [10] or XLNet [20] can also be used.Assume we have a dialogue of 𝑁 turns. Each turn consists of thepreceding system utterance (𝑆𝑑 ) and the user utterance (π‘ˆπ‘‘ ). Let𝐷 {(𝑆 1, π‘ˆ 1 ), (𝑆 1, π‘ˆ 2 ), ., (𝑆 𝑁 , π‘ˆ 𝑁 )} be the collection of turns inthe dialogue.The Utterance Encoder is a BERT model which encodes theuser and system utterances at each turn. The Schema Encoderis also a BERT model which encodes the schema descriptions ofintents, slots, and values into schema embeddings. These schemaembeddings help the decoders to transfer or share knowledge between different services by having some language understanding ofeach slot, intent, or value. The schema and utterance embeddingsare passed to the State Decoder - a multi-task module. This moduleconsists of five sub-modules producing the information necessaryto track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of theState Decoder and predicts the current state of the dialogue byaggregating and summarizing the information across turns. Detailsof all model components are presented in the following subsections.

KDD Converse’20, August 2020,A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue DatasetPrevious stateNext stateSlots: {city: San Jose,restaurant name: Billy Berk’s}Intent: FindRestaurants}IntentDecoderSlots: {city: San Jose,restaurant name: Billy Berk’s,party size:2}Intent: FindRestaurants}State Slot StatusDecoderNon-categoricalValue DecoderSchemaembeddingMemorySchema EncoderUtteranceEncoder[CLS] A leading provider for restaurant search and reservations [SEP] Find a restaurant of a particular cuisine in a city [SEP] [CLS] A leading provider for restaurant search and reservations [SEP] Party size for a reservation [SEP] [CLS] Party size for a reservation [SEP] 1 [SEP] [CLS] How many people? [SEP] Please find a table for two people. [SEP]Figure 1: The overall architecture of FastSGT (Fast Schema Guided Tracker) with exemplary inputs from a restaurant service.3.1Utterance EncoderThis module is responsible for encoding of the current turn of thedialogue. At each turn (𝑆𝑑 , π‘ˆπ‘‘ ), the preceding system utteranceis concatenated with the user utterance separated by the specialtoken of [𝑆𝐸𝑃], resulting in (𝑇𝑑 ) which is serves as input into theutterance encoder module:𝑇𝑑 [𝐢𝐿𝑆] 𝑆𝑑 [𝑆𝐸𝑃] π‘ˆπ‘‘ [𝑆𝐸𝑃](1)The output of the first token passed to the encoder is denotedas π‘Œπ‘π‘™π‘  and is interpreted as a sentence-level representation of theturn, whereas the token-level representations are denoted as π‘Œπ‘‘π‘œπ‘˜ 1 , π‘Œ 2 , ., π‘Œ 𝑀 ], where M is the total number of tokens in 𝑇 chema EncoderThe Schema Encoder uses the descriptions of intents, slots, andservices to produce some embedding vectors which represent thesemantics of slots, intents and slot values. To build these schemarepresentations we instantiate a BERT model with the same weightsas the Utterance Encoder. However, this module is used just once,before the training starts, and all the schema embeddings are storedin a memory to be reused during training. This means, they willbe fixed during the training time. This approach of handling theschema embeddings is one of the main reasons behind the efficiency of our model compared to the multi-pass models in terms ofcomputation time.We used the same approach introduced in [14] for encoding theschemas. For a service with 𝑁𝐼 intents, 𝑁𝐢 categorical slots and𝑁 𝑁𝐢 non-categorical slots, the representation of the intents aredenoted as 𝐼𝑖 , 1 𝑖 𝑁𝐼 . Schema embeddings for the categoricaland non-categorical slots are indicated as 𝑆𝑖𝐢 , 1 𝑖 𝑁𝐢 , and 𝑆𝑖𝑁𝐢 ,1 𝑖 𝑁 𝑁𝐢 respectively. The embeddings for the values of theπ‘˜-th categorical slot of a service with π‘π‘‰π‘˜ possible values is denotedas π‘‰π‘–π‘˜ , 1 𝑖 π‘π‘‰π‘˜ .Generally, the input to the Schema Encoder is the concatenation of two sequences with the [𝑆𝐸𝑃] token used as the separatorand the [𝐢𝐿𝑆] token indicating the beginning of the sequence.The Schema Encoder produces four types of schema embeddings: intents, categorical slots, non-categorical slots and categorical slot values. For a single intent embeddings 𝐼𝑖 , the first sequenceis the corresponding service description and second one is the intent description. For each categorical 𝑆𝑖𝐢 and non-categorical 𝑆𝑖𝑁𝐢slots embedding, the service description is concatenated with thedescription of the slot. To produce the schema embedding π‘π‘‰π‘˜ forthe k-π‘‘β„Ž possible value of a categorical slot, the description of theslot is used as the first sequence along with the value itself as thesecond sequence.These sequences are given one by one to the Schema Encoderbefore the main training is started and the output of the first output token embedding π‘Œπ‘π‘™π‘  is extracted and stored as the schemarepresentation, forming the Schema Embeddings Memory.

KDD Converse’20, August 2020,Noroozi et al.3.3State DecoderThe Schema Embeddings Memory along with the outputs of theUtterance Encoder are used as inputs to the State Decoder topredict the values necessary for state tracking. The State Decodermodule consists of five sub-modules, each employing a set of projection transformations to decode their inputs. We use the twofollowing projection layers in the decoder sub-modules:1) Single-token projection: this projection transformation,which is introduced in [14], takes the schema embedding vector andthe π‘Œπ‘π‘™π‘  of the Utterance Encoder as its inputs. The projection for𝐾 (π‘₯, 𝑦; 𝑝) for twopredicting 𝑝 outputs for task 𝐾 is defined as 𝐹 𝐹𝐢vectors π‘₯, 𝑦 π‘…π‘ž as the inputs. π‘ž is the embedding size, 𝑝 is the sizeof the output (e.g. number of classes), the first input π‘₯ is a schemaembedding vector, and 𝑦 is the sentence-level output embeddingvector produced by the Utterance encoder. The sources of theinputs π‘₯ and 𝑦 depend on the task and the sub-module. Function𝐾 (π‘₯, 𝑦; 𝑝) for projection 𝐾 is defined as:𝐹 πΉπΆβ„Ž 1 πΊπΈπΏπ‘ˆ (π‘Š1𝐾 𝑦 𝑏 𝐾1)β„Ž 2 πΊπΈπΏπ‘ˆ (π‘Š2𝐾 (π‘₯𝐾𝐹 𝐹𝐢(π‘₯, 𝑦; 𝑝) β„Ž1 ) 𝑏𝐾2)π‘†π‘œ 𝑓 π‘‘π‘šπ‘Žπ‘₯ (π‘Š3𝐾 β„Ž 2 𝑏 𝐾3)(2)(3)(4)whereπ‘Šπ‘–πΎ , 1 𝑖 3 and 𝑏𝑖𝐾 , 1 𝑖 3 are the learnable parametersfor the projection, and πΊπΈπΏπ‘ˆ is the activation function introducedin [5]. Symbol indicates the concatenation of two vectors. Softmaxfunction is used to normalize the outputs as a distribution over thetargets. This projection is used by the Intent, Requested Slot andNon-categorical Value Decoders.2) Attention-based projection: the single-token projectionjust takes one vector from the outputs of the Utterance Encoder.For the Slot Status Decoder and Categorical Value Decoder wepropose to use a more powerful projection layer based on multihead attention mechanism [17]. We use the schema embeddingvector π‘₯ as the query to attend to the token representations π‘Œπ‘‘π‘œπ‘˜as outputted by the Utterance Encoder. The idea is that domainspecific and slot-specific information can be extracted more efficiently from the collection of token-level representations thanjust from the sentence-level encoded vector π‘Œπ‘π‘™π‘  . The multi-head𝐾attention-based projection function 𝐹𝑀𝐻𝐴(π‘₯, π‘Œπ‘‘π‘œπ‘˜ ; 𝑝) for task 𝐾 toproduce targets with size 𝑝 is defined as:β„Ž 1 MultiHeadAtt(π‘žπ‘’π‘’π‘Ÿπ‘¦ π‘₯, π‘˜π‘’π‘¦π‘  π‘Œπ‘‘π‘œπ‘˜ , π‘£π‘Žπ‘™π‘’π‘’π‘  π‘Œπ‘‘π‘œπ‘˜ )(5)is no active intent for the current turn. An embedding vector 𝐼 0 isconsidered as the schema embedding for the 𝑁𝑂𝑁 𝐸 intent. It is alearnable embedding which is shared among all the services.The inputs to the Intent Decoder for a service are the schema embeddings 𝐼𝑖 , 0 𝑖 𝑁𝐼 from the Schema Embeddings Memoryand π‘Œπ‘π‘™π‘  of the Utterance Encoder. The predicted output of thissub-module is the active intent of the current turn πΌπ‘Žπ‘π‘‘π‘–π‘£π‘’ definedas:πΎπΌπ‘Žπ‘π‘‘π‘–π‘£π‘’ argmax 𝐹 𝐹𝐢(𝐼𝑖 , π‘Œπ‘π‘™π‘  ; 𝑝 𝑁𝐼 )(7)0 𝑖 𝑁𝐼3.3.2 Slot Request Decoder. At each turn, the user may requestinformation about a slot instead of informing the system about aslot value. For example, when a user asks for the flight time whenusing a ticket reservation service, 𝑓 π‘™π‘–π‘”β„Žπ‘‘ π‘‘π‘–π‘šπ‘’ slot of the serviceis requested by the user. This is a binary prediction task for each𝐾 (𝑆 𝑖 , π‘Œ ; 𝑝 1) slot. For this task, slot 𝑠𝑖 is requested when 𝐹 𝐹𝐢𝑗 𝑐𝑙𝑠0.5, 𝑗 𝑐, 𝑛. The same prediction is done for both categorical andnon-categorical slots.3.3.3 Slot Status Decoder. We consider four different statusesfor each slot, namely: π‘–π‘›π‘Žπ‘π‘‘π‘–π‘£π‘’, π‘Žπ‘π‘‘π‘–π‘£π‘’, π‘‘π‘œπ‘›π‘‘ π‘π‘Žπ‘Ÿπ‘’, π‘π‘Žπ‘Ÿπ‘Ÿπ‘¦ π‘œπ‘£π‘’π‘Ÿ . If thevalue of a slot has not changed from the previous state to thecurrent user state, then the slot’s status is "π‘–π‘›π‘Žπ‘π‘‘π‘–π‘£π‘’". If a slot’svalue is updated in the current user’s state into "π‘‘π‘œπ‘›π‘‘ π‘π‘Žπ‘Ÿπ‘’", thenthe status of the slot is set to "π‘‘π‘œπ‘›π‘‘ π‘π‘Žπ‘Ÿπ‘’" which means the userdoes not care about the value of this slot. If the value for the slot isupdated and its value is mentioned in the current user utterance,then its status is "π‘Žπ‘π‘‘π‘–π‘£π‘’". There are many cases where the value fora slot does not exist in the last user utterance and it comes fromprevious utterances in the dialogue. For such cases, the status isset to "π‘π‘Žπ‘Ÿπ‘Ÿπ‘¦ π‘œπ‘£π‘’π‘Ÿ " which means we should search the previoussystem or user utterances in the dialogue to find the value for thisslot. More details of the carry over mechanisms are described insubsection 3.4.The status of the categorical slot π‘Ÿ is defined as:πΎπ‘†π‘Ÿ argmax 𝐹𝑀𝐻𝐴(𝑆𝑖 , π‘Œπ‘‘π‘œπ‘˜ ; 𝑝 4)(8)0 𝑖 𝑁𝑐Similar decoder is used for the status of the non-categorical slotsas:πΎπ‘†π‘Ÿ argmax 𝐹𝑀𝐻𝐴(𝑆𝑖 , π‘Œπ‘‘π‘œπ‘˜ ; 𝑝 4)(9)0 𝑖 𝑁𝑛𝑐𝐾(6)𝐹𝑀𝐻𝐴(π‘₯, π‘Œπ‘‘π‘œπ‘˜ ; 𝑝) π‘†π‘œ 𝑓 π‘‘π‘šπ‘Žπ‘₯ (π‘Š1𝐾 β„Ž 1 𝑏 𝐾1)where π‘€π‘’π‘™π‘‘π‘–π»π‘’π‘Žπ‘‘π΄π‘‘π‘‘ is the multi-head attention function introduced in [17], and π‘Š1𝐾 and 𝑏 𝐾1 are learnable parameters of a linearprojection after the multi-head attention. To accommodate paddedutterances we use attention masking to mask out the padded portion

Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. We used BERT [3] for both encoders in our model, but similar models like RoBERTa [10] or XLNet [20] can also be used. Assume we have a dialogue of turns. Each .

Related Documents:

How multilingual is Multilingual BERT? Telmo Pires Eva Schlinger Dan Garrette Google Research ftelmop,eschling,dhgarretteg@google.com Abstract In this paper, we show that Multilingual BERT (M-BERT), released byDevlin et al.(2019) as a single language model pre-trained from monolingual corpor

Differential Quadrature Method (DQM) is a powerful method which can be used to solve numerical problems in the analysis of structural and dynamical systems. In this study the governing . The finite difference method is a well- . (Bert et al. 1987, Striz et al. 1988, Shu and Richards 1992, Bert et al. 1993, Striz et al. 1995, Bert and Malik .

sentence-transformers 2With semanticallymeaningfulwe mean that semantically similar sentences are close in vector space. tic similarity comparison, clustering, and informa-tion retrieval via semantic search. BERT set new state-of-the-art performance on various sentence classification and sente

4 Transfer Fine-Tuning with Paraphrasal Relation Injection We inject semantic relations between a sentence pair into a pre-trained BERT model through classi-fication of phrasal and sentential paraphrases. Af-ter the training, the model can be fine-tuned in ex-actly the same manner as with BERT models. 4.1 Overview

transfer learning. One important reference in this field is the BERT language representation model which serves as basis for many zero-shot cross-lingual transfer. Trained on the top 104 Wikipedia versions, multilin-gual BERT has proven competitive in many NLP tasks. [6] Despite not benefiting from cross-lingual

The quick eye diagram allows a one-shot check for a valid signal. Due to the higher sampling depth of a BERT, the eye contour lines visualize the measured eye at a deeper BER level for more accurate results. Extrapolated eye contour lines display the eye opening for even lower BER levels, su

The MikroTik Fast Path and Conntrack's work together gave the name Fast Track. Fast Track Fast Path extentions Only Ipv4 TCP/UDP (Total Traffic %99) FastTrack management is left to network admin FastTrack can be used on devices with Fast Path support. After the first packet of the connection passing through the router is marked as Fast Track .

Business Accounting Volume 1is the world’s best-selling textbook on bookkeeping and accounting. Now in its tenth edition, it has become the standard introductory text for accounting students and professionals alike. New to this edition: Over 120 brand new review questions for exam practice Coverage of International Accounting Standards 2005 Additional and updated worked examples for areas of .