RoBERTaIQ: An Efficient Framework For Automatic .

2y ago
24 Views
2 Downloads
1.76 MB
9 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Ryan Jay
Transcription

RoBERTaIQ: An Efficient Framework for Automatic InteractionQuality Estimation of Dialogue SystemsSaurabh GuptaXing FanDerek Liugsaur@amazon.comAlexa AI, AmazonSeattle, USAfanxing@amazon.comAlexa AI, AmazonSeattle, USAderecliu@amazon.comAlexa AI, AmazonSeattle, USABenjamin YaoYuan LingKun Zhoubenjamy@amazon.comAlexa AI, AmazonSeattle, USAyualing@amazon.comAlexa AI, AmazonSeattle, USAzhouku@amazon.comAlexa AI, AmazonSeattle, USATuan-Hung PhamChenlei Guohupha@amazon.comAlexa AI, AmazonSeattle, USAguochenl@amazon.comAlexa AI, AmazonSeattle, USAABSTRACTAutomatically evaluating large scale dialogue systems’ responsequality is a challenging task in dialogue research. Existing automated turn-level approaches train supervised models on InteractionQuality (IQ) labels or annotations provided by experts, which iscostly and time-sensitive. Moreover, the small quantity of annotated data limits the trained model’s ability to generalize to the longtail and out of domain cases. In this paper, we propose a learningframework that improves the model’s generalizability by leveragingvarious unsupervised data sources available in large-scale conversational AI systems. We mainly rely on the following three techniquesto improve the performance of dialogue evaluation models: First,we propose extending the RoBERTa model to encode multi-turn dialogues to capture the temporal differences between different turns.Second, we add two additional pretraining processes on top of enhanced multi-turn RoBERTa to take advantage of large quantity ofexisting historical dialogue data through self-supervised training.Third, we perform fine-tuning on IQ labels in a multi-task learningsetup, leveraging domain-specific information from other tasks. Weshow that the above techniques significantly reduce annotated datarequirements. We achieve the same F1 score on IQ prediction taskas our baseline with only 5% of IQ training data and further beatthe baseline by 5.4% absolute F1 score if we use all of the trainingdata.ACM Reference Format:Saurabh Gupta, Xing Fan, Derek Liu, Benjamin Yao, Yuan Ling, Kun Zhou,Tuan-Hung Pham, and Chenlei Guo. 2021. RoBERTaIQ: An Efficient Framework for Automatic Interaction Quality Estimation of Dialogue Systems.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.KDD ’21, Aug 14–18, 2021, Virtual Event 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . 15.00https://doi.org/10.1145/nnnnnnn.nnnnnnnIn Proceedings of DeMaL, 2𝑛𝑑 International Workshop on Data-EfficientMachine Learning (KDD ’21). ACM, New York, NY, USA, 9 pages. ONLarge scale conversational agents like Amazon Alexa, Apple Siri,and Google Assistant have set a standard for conversational AIwith the ability to integrate seamlessly across a wide range offunctionalities. Such systems are complex in nature with manysequential components, such as Automatic Speech Recognition(ASR), Natural Language Understanding (NLU), Dialogue Manager,and Natural Language Generation. As the scope of these systemsis increasing to cover more scenarios and applications, it becomesvital to automatically evaluate the response quality of these agentsto estimate user satisfaction. In particular, identifying problematicresponses where the user was left dissatisfied can be useful inimproving dialogue agents over time with data driven learning[3, 16, 26].Previous approaches for automated dialogue evaluation can beclassified into Dialogue-level or Turn-level, based on whether weare evaluating multiple exchanges at once or each exchange (user’sutterance and agent’s response) individually. The PARADISE (PARAdigm for DIalogue System Evaluation) framework is the most wellknown evaluation framework proposed for evaluating dialoguelevel user satisfaction [23]. In PARADISE, a linear regression modelis fitted to predict the dialogue-level user satisfaction for a given setof manually extracted input features and user ratings. In contrastto rating the dialogue as a whole, approaches such as InteractionQuality (IQ) [19] were proposed to capture user satisfaction atturn level. Here, an SVM model [4] is learned based on the ratingsprovided by human annotators, while the input features were automatically extracted based on interaction parameters and emotions.More recent approaches [1, 2, 11] extend the IQ framework anduse models like Gradient Boosting Decision Tree (GBDT) [6], Recurrent Neural Networks (RNN) [17], and Long Short-Term MemoryNetworks (LSTM) [18] to encode the dialogue session. However,in addition to using textual data, these approaches also rely on

KDD ’21, Aug 14–18, 2021, Virtual EventGupta et al.Table 1: Examples of Dialogue Sessions with different turnlevel IQ labels (Unsatisfied 1, Satisfied 0)#Timestamp(in seconds)Dialogue Session𝜏 0𝜏 4[USER] Stop[AGENT] null 𝜏 0[USER] Play maj and dragons.[AGENT] Sorry, I can’t find that.1𝜏 0[USER] Play hello[AGENT] Here’s Hello, by Pop Smoke.1𝜏 6[USER] Play hello by Adele[AGENT] Here’s hello by Adele0𝜏 0[USER] Play hello[AGENT] Here’s Hello, by Pop Smoke.0𝜏 60[USER] Play hello by Adele[AGENT] Here’s hello by Adele0𝜏 0[USER] show me shark videos[AGENT] Here’s what I found (playing video)1𝜏 8[USER] play baby shark on amazon prime[AGENT] Here’s Baby Shark , by Pinkfong , on Amazon Music.0112IQ label[USER] Play hello[AGENT] Here’s Hello, by Pop Smoke.345input features generated by internal components, such as NLU/ASRconfidence scores and Dialogue Status. These signals introducedependencies on internal components and force the model to besystem specific. As a result, our work does not leverage these signals; instead, we focus on more powerful model architectures thatcan capture user satisfaction using the textual and temporal information alone.We hypothesize that user satisfaction can be inferred using explicit and implicit user/agent behaviors that exist in the dialoguesession. Dialogues 1 and 2 in Table 1 are examples of explicit userand agent behavior, respectively. In dialogue 1, the user terminatedthe request as the agent did not play the right song. In dialogue2, the agent failed to handle the request due to an error in entityresolution, caused by ASR error. Dialogues 3 and 4 capture user’sintention implicitly and highlight the importance of temporal information. In dialogue 3, the user did not intend to listen to PopSmoke and thus immediately interrupted the agent by rephrasingthe original request. However, in dialogue 4, the user listened to“Hello, by Pop Smoke” for 𝜏 60 seconds before issuing the next request. This arguably leads to the conclusion that the user intendedto listen to Pop Smoke and Adele thereafter. Dialogue 5 shows whyit is important to capture context from other turns. The agent’saction in the first turn did not satisfy user’s requirement as the userwas looking for Baby Shark specifically.These examples emphasize the significance of capturing dialogue context as precisely as possible, to correctly estimate usersatisfaction. From the perspective of offline dialogue evaluation,the dialogue context should not only include the previous and thefollowing turns, but also the temporal differences between differentturns. To this end, we design a novel transformer based dialogueencoder, so as to utilize its self-attention mechanism [22] acrosstokens of different turns of the dialogue, while making the modelaware of the temporal differences between turns. We build on topof RoBERTa encoder [13] and refer to our model as RoBERTaIQ.Another major challenge imposed by automatic dialogue evaluation is collecting large amount of human annotations or IQ labels,which can be costly and time consuming. Recent advances in pretraining using self-attention encoder architectures like BERT [5]and RoBERTa [13] have been commonly used in many NLP applications. Such models are usually trained on massive general textcorpora like English Wikipedia. However, the underlying differenceof linguistic patterns between general text and dialogues makesexisting pretrained language models less useful in practice. Wu et al.[24] have successfully shown that pretraining for task-oriented dialogues can be more useful than using general pretrained languagemodels. However, there are only a few related works that leveragepretraining for automated dialogue evaluation. Liang et al. [10]learn dialogue feature representation with a self-supervised dialogue flow anomaly detection task, while Sinha et al. [20] train textencoders via noise contrastive estimation (NCE) [8]. Inspired by thesuccess of domain-adaptive (DAPT) and task-adaptive pretraining(TAPT) [7, 9], we adopt the multi-stage pretraining process on largescale historical dialogue data and IQ task training data. Furthermore,we make our training process more data efficient by following theMulti-Task DNN learning framework for NLU [12]. We cast ourlearning process in a multi-task setting leveraging large amountsof cross-task data and regularization benefits. When fine-tuning,we learn to jointly predict turn-wise IQ label, domain and intent.The domain and intent signals are obtained from a separate NLUclassification system and do not introduce additional annotationcosts.In summary, we make the following contributions: We design a novel transformer based dialogue encoder: RoBERTaIQ for inferring turn-level user satisfaction in multi-turndialogues. We show the effectiveness of RoBERTaIQ in capturing dialogue context and temporal information across turns by comparing it with previous state of the art discourse-structureaware text encoders. We propose a data efficient learning framework to significantly reduce the amount of annotated data required forlearning RoBERTaIQ. We leverage unlabelled historical dialogue data for pretraining. We perform fine-tuning in amulti-task learning setup to further utilize readily availablesignals like Domain and Intent. Unlike other works that usethese signals as input features [1, 11], our model uses themas supervision signals to reduce training data (IQ labels)requirement.The rest of the paper is organized as follows. Section 2 reviewsexisting work. Section 3 presents baselines and our approach forautomatic dialogue evaluation. Section 4 presents our experimentalresults. Section 5 shows different ablation studies. We conclude ourpaper in Section 6. Appendices A and B contain hyperparameterinformation and case studies, respectively.2RELATED WORKRecent works on evaluation of response quality in dialogue systems[1, 2, 11] are closely related to our work. While [2] use human engineered NLP features, [1, 11] propose IQ prediction models that useinput features directly from raw dialogue turn contents and system

RoBERTaIQ: An Efficient Framework for Automatic Interaction Quality Estimation of Dialogue SystemsKDD ’21, Aug 14–18, 2021, Virtual Eventmetadata (e.g. ASR/NLU scores). However, we see the reliance onsystem metadata as a limitation, and design our approach suchthat no system metadata is required as input features to the model.While the above approaches focus on dialogue evaluation in SpokenLangugage Understanding (SLU) systems, there is another line ofwork that focuses more on evaluation of open-domain chit-chatstyle dialogues. Lowe et al. [14] proposed a supervised approachcalled ADEM to mimic human annotator’s assessment of responseappropriateness, while Tao et al. [21] proposed an unsupervisedmethod called RUBER. Both of these approaches use RNN basedencoders. However, both ADEM and RUBER metrics result in poorcorrelation with human judgements [27]. Zhao et al. [27] proposeRoBERTa-eval, which uses a powerful RoBERTa based text encoderto represent the dialogue context. Recently, Sinha et al. [20] propose MaUdE, which uses a BERT based text encoder to encode theutterances, followed by an RNN to model dialogue transitions. Weadapt MaUdE and RoBERTa-eval to our use-case. We use them asbaselines to analyze their shortcomings and design our dialogueencoder with enhanced contextual and temporal representations.3METHODOLOGYIn this section, we first define the notations and provide the problem definition. Next, we present the baseline model architecturesadapted to our use-case: MaUdE and RoBERTa-eval. We then sharethe details of our proposed architecture: RoBERTaIQ that encodes aflattened dialogue text sequence and explain how each dialogue session is processed before inputting to this model. Next, we introducehow we obtain the datasets used for experiments, followed withexplanation of the training procedure that involves pretraining andmulti-task fine-tuning.3.1Notations and Problem DefinitionWe consider a dataset 𝐷 of 𝑀 multi-turn dialogue sessions, suchthat 𝐷 {𝑆 𝑗 }𝑀𝑗 1 , and every session 𝑆 is an ordered set of 𝑁 turns:𝑁 . Here 𝑖 indicates the index of turn, and each turn 𝑡𝑆 {𝑡𝑖 }𝑖 1𝑖consists of a pair (𝑄𝑖 , 𝑅𝑖 ), where 𝑄𝑖 is the user’s query and 𝑅𝑖 isthe agent’s response to query 𝑄𝑖 . Each turn 𝑡𝑖 also has a timestamp𝜏𝑖 associated with it, which is the time at which 𝑄𝑖 was received bythe agent. Any two successive turns have a time gap of less than aminute. Given a dialogue session S and a reference turn 𝑡𝑟𝑒 𝑓 𝑡𝑖for some 𝑖 {1, . . . , 𝑁 }, the goal of our model is to predict 𝐼𝑄𝑠𝑐𝑜𝑟𝑒of turn 𝑡𝑟𝑒 𝑓 . 𝐼𝑄𝑠𝑐𝑜𝑟𝑒 0 if the agent’s response 𝑅𝑟𝑒 𝑓 to query 𝑄𝑟𝑒 𝑓is satisfactory from user’s perspective, and 1 otherwise. We focuson offline turn-level dialogue evaluation, which means that we haveboth previous turns and following turns available at the time ofevaluating 𝑡𝑟𝑒 𝑓 .3.2Baseline models3.2.1 MaUdE . MaUdE (Metric for automatic Unreferenced dialogue Evaluation) was proposed by Sinha et al. [20] for onlinedialogue evaluation. Here, we adapt MaUdE’s Dialogue-structureaware encoder for offline evaluation, and slightly modify the architecture such that it can encode other meta information about eachturn, such as domain, intent, timestamp, active screen availabilityetc. We use similar metadata features as used by Ling et al. [11].We refer to this modification as MaUdE .Figure 1: MaUdE (baseline model) architectureAs shown in Figure 1, MaUdE first computes the encodingfor each turn and then passes the turn encodings through a bidirectional GRU to compute the dialogue session embedding. Thissession embedding is concatenated with other features and fed tothe classifier for IQ prediction. Considering a dialogue session 𝑆with 𝑛 turns: {(𝑄 1, 𝑅1 ), ., (𝑄𝑛 , 𝑅𝑛 )}, we compute the IQ score as:fi 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝐶𝐿𝑆 ([𝐶𝐿𝑆]; 𝑄𝑖 ; [𝑆𝐸𝑃]; 𝑅𝑖 )ei (fi ; metai ) hn, h1 𝐵𝑖𝐺𝑅𝑈 (e1, e2, ., en ) 𝐼𝑄𝑠𝑐𝑜𝑟𝑒 𝜎 (𝑊 .(eref ; hn ; h1 ))(1)(2)(3)(4)Here, [CLS] refers to a special token prefixed to user query, [SEP]refers to a special token inserted between the query and response,“;” denotes the concatenation operator, ei refers to the encoding ofturn 𝑡𝑖 , metai refers to the concatenated encodings of categorical and real-valued features of turn 𝑡𝑖 . hn and h1 are the final hiddenstates of the bidirectional GRU in either direction and eref refersto the encoding of the reference turn for which we want to makethe IQ score prediction. To compute the text representation fi , weuse the [CLS] token encoding from the RoBERTa encoder. We startwith a pretrained RoBERTa model and finetune it end-to-end withgradients coming from IQ classification loss.3.2.2 RoBERTa-eval. RoBERTa-eval was proposed by Zhao et al.[27] as a robust dialogue response evaluator. It produces a vector dgiven a context c and a response 𝑅𝑟𝑒 𝑓 and then finally calculates

KDD ’21, Aug 14–18, 2021, Virtual EventGupta et al.Figure 2: RoBERTaIQ model architectureits score via a Multilayer Perceptron with a sigmoid function. Considering a dialogue session 𝑆 with 𝑛 turns: {(𝑄 1, 𝑅1 ), ., (𝑄𝑛 , 𝑅𝑛 )}:𝑐 [[𝐶𝐿𝑆]; 𝑄 1 ; [𝑆𝐸𝑃]; 𝑅1 ; [𝑆𝐸𝑃]; 𝑄 2 ; [𝑆𝐸𝑃]; . . . 𝑄𝑟𝑒 𝑓 ](5)d 𝑅𝑜𝐵𝐸𝑅𝑇 𝑎𝐶𝐿𝑆 (𝑐; [𝑆𝐸𝑃]; 𝑅𝑟𝑒 𝑓 )(6)𝐼𝑄𝑠𝑐𝑜𝑟𝑒 𝜎 (𝑀𝐿𝑃 (d))(7)Here, c, the dialogue context, is a flattened sequence of userqueries and agent responses from previous turns, including thequery of reference turn, for which we want to predict the IQ score.Note that a limitation of this model is that it can only encode theleft dialogue context, i.e. turns that happened before 𝑄𝑟𝑒 𝑓 .3.3Our approach: RoBERTaIQFigure 2 shows the RoBERTaIQ model architecture diagram. Unlike previous works that rely on textual features and many othersystem specific signals [11, 15] which require feature engineeringefforts, RoBERTaIQ relies solely on the textual features, i.e. user’sutterances and agent’s responses. RoBERTaIQ model is built on topof RoBERTa-base model with modifications at the input layer. Todifferentiate between the utterances and responses, we add twospecial tokens to the vocabulary, [USER] and [AGENT] and prefixthem to each utterance and response respectively. By doing this,we create a single flat sequence for the whole dialogue. We limitthe dialogue length to 512 tokens. Table 2 shows how a dialoguesession is pre-processed.Temporal difference encoding: In addition to capturing thetext contextual information as shown above, we also capture thetime difference between multiple turns in case of a multi-turn dialogue. Capturing the time difference is an important factor for IQprediction as users are likely to immediately interrupt the agent ifthey do not get the right response. They might even rephrase theirTable 2: Pre-processing of a dialogue sessionTimestamp(in seconds)Dialogue Session𝜏 0[USER] Play fearless.[AGENT] Playing fearless by Pink Floyd.𝜏 2[USER] Stop[AGENT] null [USER] Play fearless by Taylor Swift.[AGENT] Here’s fearless by Taylor Swift.Pre-Processed form[USER] Play fearless [AGENT] Playing fearless by Pink Floyd[USER] Stop [AGENT] [USER] Play fearless by Taylor Swift[AGENT] Here’s fearless by Taylor Swift𝜏 7request or add more information and hope the agent can take theaction they expected in the follow-up turn.To make the model aware of these temporal differences betweenthe turns, we first select a reference turn in the dialogue sessionand refer to its timestamp as 𝜏𝑟𝑒 𝑓 . This turn is selected at randomwhen pretraining. During fine-tuning, this is the turn for which wehave the IQ label. We then calculate the time difference Δ𝑖 for allturns with respect to 𝜏𝑟𝑒 𝑓 : Δ𝑖 𝜏𝑖 𝜏𝑟𝑒 𝑓 , where 𝜏𝑖 is the timestampof turn 𝑖. Δ𝑖 for all turns are then discretized using equal widthbinning. We create 16 bins to represent equal sized intervals in Δ𝑖 ’srange of [ 60, 60] seconds and map Δ𝑖 to its respective time bin:[𝐵𝐼 𝑁𝑖 ]. The corresponding time-bin embeddings are added to eachtoken of the turn at the input layer of the model, depending onthe turn’s bin, as shown in Figure 2. These embeddings are learned

RoBERTaIQ: An Efficient Framework for Automatic Interaction Quality Estimation of Dialogue Systemsfrom scratch. The number of bins is decided in a way to ensure auniform distribution of turns across the bins. We reserve a specialbin: [𝐵𝐼 𝑁 0 ] for reference turn’s tokens. [𝐵𝐼 𝑁 0 ] is the key indicatorusing which the model recognizes the reference turn.Task specific heads: As shown in Figure 2, the model has various heads: MLM (Masked Language Modeling) and classifier headsfor IQ (Interaction Quality), Domain and Intent classification. During multi-stage pretraining, we only use the MLM head for calculating the loss and updating the weights. The MLM head operateson the output representations of tokens. For multi-task fine-tuning,we use the classification heads for calculating the loss. Each of theclassification heads takes the encoded representation of the [CLS]token as input. Each classifier head has a dense layer, followed bya projection layer. All heads are initialized randomly. The outputsize of the projection layer is equal to the number of labels of therespective task.Note that we do not design the architecture with real-time/onlineIQ prediction in mind. We focus on offline evaluation where wehave previous and next turns available, when evaluating the qualityof the reference turn. However, our design is easily extensible toonline evaluation, in which case, the reference turn will always bethe last turn in the dialogue session.3.4DatasetsHistorical Dialogue Sessions: We randomly sample around 2million English dialogue sessions between users and Alexa fromanonymized logged historical data. We do not use any task specifichuman annotations for these dialogue sessions. These sessions spanmany NLU domains and intents, and contain turns where the usershad both good and bad experiences. As described later, we use thesedialogue sessions for the first stage of pretraining.Interaction Quality (IQ) dataset: This dataset is sampled fromAlexa Live Traffic and is annotated with IQ labels provided by experts: 0 (Non-defect or satisfactory experience) and 1 (Defect). Onlyone turn per dialogue session has a defect/non-defect label, whichwe refer to as the reference turn. The reference turns are labelledfrom the end user’s perspective. For example, considering turn 1 asthe reference turn in Table 2, the annotators would give it an IQlabel of 1 (defective), as the agent did not play the song intendedby the user in that turn. To get the IQ labels, we use a similar Response Quality annotation workflow as described in [2]. Wehave around 500K dialogues for training, 100K for development set,and 100K for testing. The training and test sets are sampled fromdifferent time periods. This leads to a test set that has a different domain distribution than the training set, and also some new domainsthat are not present in the training set. All the turns have domainand intent strings that were produced by a separate NLU system.The IQ prediction task is the primary task at which we want to dobetter with the least amount of human annotated data possible. Forevaluation, we focus on F1-score for the defective class as the binaryclassification metric. We do so because our dataset is imbalanced(25% defect and 75% non-defect) and identifying dissatisfactoryturns is of more importance.KDD ’21, Aug 14–18, 2021, Virtual EventOut-of-domain (OOD) testset: This dataset is sampled fromannotated IQ test data. It has 30K instances in total and is only usedfor evaluation in particular to see the benefits and limitations ofpretraining and multi-task learning on out-of-domain instances.To ensure that there is no overlap between the domains/intentsof the turns in this testset with the training set, we sample threedifferent fractions (5%, 10% and 25%) from the IQ training dataset(500K instances), and use these subsets for training the models.3.5Pretraining: Domain adaptive and TaskadaptiveMasked Language Modeling is a common pretraining strategy fortransformer based architectures in which a random sample of thetokens in the input sequence is selected and replaced with thespecial [MASK] token. The MLM loss function is the cross-entropyloss on predicting the masked tokens. Following Liu et al. [13], weconduct token masking dynamically with each batch by masking15% of the tokens. RoBERTaIQ is initialized from RoBERTa-base andis further pretrained as described below. The MLM loss function isdefined as:𝐿𝑚𝑙𝑚 𝑀Õlog 𝑃 (𝑥𝑚 )(8)𝑚 1where 𝑀 is the total number of masked tokens and 𝑃 (𝑥𝑚 ) is thepredicted probability of token 𝑥𝑚 .Following [7], we perform the first stage of pretraining on unlabelled historical dialogue sessions data, which we refer to asDomain Adaptive Pretraining (DAPT). Similar to [9], we then further pretrain this model using the MLM loss on the IQ trainingdataset in the second stage, which we refer to as Task AdaptivePretraining (TAPT). Note that both DAPT and TAPT do not requireany task specific labels.3.6Multi-task (MT) fine-tuningAfter the multi-stage pretraining process, we finetune the model onthe main downstream task of IQ prediction, with additional headsfor Domain and Intent prediction. Our hypothesis is that we canbenefit from both cross-task data and the regularization effects ofMT, especially when the IQ data is small. The multi-task loss isdefined in Equation 9:Õ𝐿(𝜃 ) 𝜆𝜓 𝑙 (𝑦𝜓𝑖 , 𝑓𝜓 (𝐸𝑛𝑐𝐶𝐿𝑆 (𝑥𝜓𝑖 )))(9)𝑥𝜓𝑖 ,𝑦𝜓𝑖 𝐷𝜓where 𝜓 refers to one of the tasks (IQ, Domain, Intent), 𝑥 𝑖 , 𝑦𝑖 referto raw dialogue features and task labels respectively, 𝐸𝑛𝑐𝐶𝐿𝑆 (𝑥𝜓𝑖 )refers to the encoding of [𝐶𝐿𝑆] token after passing 𝑥𝜓𝑖 through theshared RoBERTaIQ encoder, 𝑓𝜓 is the respective task classifier, 𝑙 isthe cross-entropy loss and 𝜆𝜓 is the task weight. We empiricallyset 𝜆𝐼𝑄 1, and 𝜆𝑑𝑜𝑚𝑎𝑖𝑛 𝜆𝑖𝑛𝑡𝑒𝑛𝑡 0.5.4EXPERIMENTSIn this section, we first compare RoBERTaIQ with other baselinesto see the effects of different model architectures. We then showthe results of RoBERTaIQ with multi-stage pretraining and multitask fine-tuning with varying amounts of IQ training data. All the

KDD ’21, Aug 14–18, 2021, Virtual EventGupta et al.experiments were conducted on an AWS p3.16xlarge instance with8 GPUs. All the numbers reported with “ ” prefixes denote absolutedifferences in the metric w.r.t the corresponding baseline. Othertraining details and hyperparameters can be found in the AppendixA.4.1Comparison with baselinesTable 3: RoBERTaIQ vs baselines on IQ test set (100K examples)Perf(%)AccuracyF1PrecisionRecallMaUdE (Text features only) [20]MaUdE ( system metadata)86.5 2.177.4 1.978.0 2.776.9 1.1RoBERTa-eval [27]- 3.3- 5.4- 5.9-5RoBERTaIQ (This work) 4.2 6.1 6 6Using 100% IQ training dataTable 3 shows the performance comparison between RoBERTaIQand other baselines on IQ test set. We use the full IQ training data(500K instances) to train all the models. The RoBERTa encoderweights are initialized with a pre-trained model 1 and are finetunedend-to-end with gradients coming from IQ classification loss. Wedo not apply any multi-tasking or pretraining strategies for thiscomparison. Using system metadata features helps increase theperformance of MaUdE by 1.9% F1 score, but RoBERTaIQ, whichuses only the text features, still outperforms it by 4.2% absolute F1score. Please refer to Appendix B for a case study between MaUdE and RoBERTaIQ.RoBERTa-eval performs worse than the MaUdE baseline by5.4% F1 score. This is mainly due to the fact that RoBERTa-eval seesonly the left context (previous turns).4.2RoBERTaIQ Full Results on IQ testsetTable 4: Model performance comparison with pretrainingand finetuning on varying amounts of IQ training data. Allthe rows show evaluation metrics on IQ testset (100K instances)Perf (%)IQ (5% Data)Scratch (Baseline) Multi-task DAPT DAPT Multi-taskIQ (10% Training Data)Scratch (Baseline) Multi-task DAPT DAPT Multi-taskIQ (25% Training Data)Scratch (Baseline) Multi-task DAPT DAPT Multi-taskIQ (50% Training Data)Scratch (Baseline) Multi-task DAPT DAPT Multi-taskIQ (100% Training Data)Scratch (Baseline) Multi-task DAPT TAPT DAPT Multi-task TAPT Multi-taskAccuracyF1PrecisionRecall84.7 2.7 0.5 3.872.2 5.2 1.8 7.175.1 4.5-1.1 6.369.5 5.7 4.5 7.588.2 0.5 1.1 1.478.6 1.9 2.7 2.981.6-2.3-0.2 0.875.9 5.8 5.3 4.889.70.0 0.8 0.582.1 0.2 1.5 1.081.7-0.6 0.2-0.382.4 1.0 3.1 2.590.1 0.2 0.6 0.682.2 0.6 1.9 1.484.2-0.3-1.8-0.180.4 1.5 5.6 2.990.7-0.3 0.4 0.5 0.1 0.183.5-0.3 0.7 1.2 0.3 0.4484.0-0.4 0.2 0.6 0.6 0.782.9-0.1 1.4 1.8 0.1 0.3see TAPT providing a further boost over DAPT.Table 4 shows the performance of RoBERTaIQ model on IQ testsetwith different training settings. To show the respective benefits ofpretraining and multi-task learning, we train models with varyingamount of IQ training data. “Scratch” refers to training on IQ only,without pretraining or multi-task learning. In DAPT runs, we startwith a RoBERTaIQ model pretrained on historical dialogue sessionsand finetune on IQ data. In TAPT runs, we start with the DAPTpretrained model, and further pretrain on the IQ training data withMLM loss (without using the IQ labels) and then fine-tune on IQtraining data with the classification loss. We perform TAPT experiments only for the cases where we use 100% of available IQ trainingdata. For all the Multi-task runs, we include other classificationtasks (Domain and Intent) in addition to IQ for fine-tuning.Effects of Multi-task learning: The performance improvements that come with multi-task learning vary with the IQ trainingdataset size. We see maximum benefits, i.e. an increase in F1 scoreby an absolute 5.2% when we use only 5% IQ training dataset. Theimprovements diminish with increasing IQ training dataset size,even leading to a slight decrease in F1 score when 100% for IQdataset is used for training. This leads to the conclusion that theadditional tasks help the model learn better through extra supervision when IQ training data is small. But after a certain point asthe training dataset size increases, cross-task knowledge transferbecomes less useful and instead the regularization effects of othertasks start hurting the performance on the primary task of IQ prediction.Effects of DAPT and TAPT: As can be seen from Table 4, DAPTprovides consistent benefits in terms of boosting the performanceon IQ prediction task, improving the F1 score by absolute 2.7% inthe best case. This successfully demonstrates knowledge transferfrom unlabelled historical data to the downstream task of IQ prediction. For the scenario where we use 100% of IQ training data, weOverall, we find that combining pretraining and multi-task learning provides gains as big as 7.1% F1 score improvement with smallertraining dataset sizes. In other words, using these techniques, wecan significantly reduce the amount of training dataset

[USER] Play hello by Adele [AGENT] Here’s hello by Adele 1 0 4 0 60 [USER] Play hello [AGENT] Here’s Hello, by Pop Smoke. [USER] Play hello by Adele [AGENT] Here’s hello by Adele 0 0 5 . RoBERTa-eval, which uses a powerful RoBERTa based text encoder to represe

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI

API Recommended Practice 2A-WSD Planning, Designing, and Constructing Fixed Offshore Platforms—Working Stress Design TWENTY-SECOND EDITION NOVEMBER 2014 310 PAGES 395.00 PRODUCT NO. G2AWSD22 This recommended practice is based on global industry best practices and serves as a guide for those who are concerned with the design and construction of new fixed offshore platforms and for the .