Multi-Task Learning & Transfer Learning Basics

1y ago
12 Views
3 Downloads
5.43 MB
39 Pages
Last View : 20d ago
Last Download : 3m ago
Upload by : Cannon Runnels
Transcription

Multi-Task Learning & Transfer Learning BasicsCS 3301

LogisticsOptional homework 0 due Monday 9/27.PyTorch review session tomorrow at 6:00 pm PT.Project guidelines posted.Office hours start today2

Plan for TodayMulti-Task Learning- Problem statement- Models, objectives, optimization- Challenges- Case study of real-world multi-task learningTransfer Learning- Pre-training & fine-tuningGoals for by the end of lecture:- Know the key design decisions when building multi-task learning systems- Understand the difference between multi-task learning and transfer learning- Understand the basics of transfer learning3

Multi-Task Learning4

Some notationWhat is a task?A task:θ(more formally this time)𝒯i {pi(x), pi(y x), ℒi}data generating distributionsTypical loss: negative log likelihoodℒ(θ, 𝒟) 𝔼(x,y) 𝒟[log fθ(y x)]Corresponding datasets:5catatlynxer ctiglength of paperfθ(y x)Single-task learning: 𝒟 {(x, y)k}[supervised]min ℒ(θ, 𝒟)tigyxerθtr𝒟itest𝒟itrwill use 𝒟i as shorthand for 𝒟i :

Examples of TasksA task:𝒯i {pi(x), pi(y x), ℒi}data generating distributionsCorresponding datasets:tr𝒟iMulti-task classification: ℒi same across all taskse.g. per-languagehandwriting recognitione.g. personalizedspam filtertest𝒟itrwill use 𝒟i as shorthand for 𝒟i :Multi-label learning: ℒi , pi(x) same across all taskse.g. CelebA attribute recognitione.g. scene understandingWhen might ℒi vary across tasks?6-mixed discrete, continuous labels across tasksmultiple metrics that you care about

θlength of paperyxsummary of paperpaper reviewzifθ(y x) fθ(y x, zi)task descriptore.g. one-hot encoding of the task indexor, whatever meta-data you have-personalization: user features/attributeslanguage description of the taskformal specifications of the taskVanilla MTL Objective: minDecisions on the model, the objective, and the optimization.How should we condition on zi ?What objective should we use?How to optimize our objective?7θT i 1ℒi(θ, 𝒟i)

ModelHow should the model be conditioned on zi?ObjectiveHow should the objective be formed?What parameters of the model should be shared?Optimization How should the objective be optimized?8

Conditioning on the taskLet’s assume zi is the one-hot task index.Question: How should you condition on the task in order to share as little as possible?9

Conditioning on the taskziy1xmultiplicative gatingy2x j xy yT— independent training within a single network!10 parameterswith no shared1(zi j)yj

The other extremexyziConcatenate zi with input and/or activationsall parameters are shared(except the parameters directly following zi, if zi is one-hot)11

An Alternative View on the Multi-Task ArchitectureshSplit θ into shared parameters θ and task-specific parameters θThen, our objective is:Choosing how tocondition on ziTθ sh,θ 1, ,θ T i 1minequivalent to12shiiℒi({θ , θ }, 𝒟i)Choosing how & whereto share parameters

Conditioning: Some Common Choices1. Concatenation-based conditioning2. Additive conditioningziziThese are actually equivalent!Question: why are they the same thing? (raise your hand)Concat followed by afully-connected layer:Diagram sources: distill.pub/2018/feature-wise-transformations/13

Conditioning: Some Common Choices3. Multi-head architecture4. Multiplicative conditioningRuder ‘17Why might multiplicativeconditioning be a good idea?-more expressive per layerrecall: multiplicative gatingMultiplicative conditioning generalizesindependent networks and independent heads.Diagram sources: distill.pub/2018/feature-wise-transformations/14

Conditioning: More Complex ChoicesCross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16Multi-Task Attention Network. Liu, Johns, Davison ‘18Deep Relation Networks. Long, Wang ‘15Perceiver IO. Jaegle et al. ‘2115

Conditioning ChoicesUnfortunately, these design decisions arelike neural network architecture tuning:-problem dependentlargely guided by intuition orknowledge of the problemcurrently more of an art than ascience16

ModelHow should the model be conditioned on zi?ObjectiveHow should the objective be formed?What parameters of the model should be shared?Optimization How should the objective be optimized?17

Vanilla MTL Objective minθ-How to choose wi?T i 1ℒi(θ, 𝒟i)TOften want to weightminwiℒi(θ, 𝒟i)tasks differently:θ i 1dynamically adjustthroughout training-a. various heuristicsencourage gradients to have similar magnitudes(Chen et al. GradNorm. ICML 2018)b. optimize for the worst-case task lossmin max ℒi(θ, 𝒟i)θi(e.g. for task robustness, or for fairness)18manually based onimportance or priority

ModelHow should the model be conditioned on zi?ObjectiveHow should the objective be formed?What parameters of the model should be shared?Optimization How should the objective be optimized?19

Optimizing the objectiveVanilla MTL Objective: minθT i 1ℒi(θ, 𝒟i)Basic Version:1. Sample mini-batch of tasks ℬ {𝒯i}2. Sample mini-batch datapoints for each task̂ ℬ) 3. Compute loss on the mini-batch: ℒ(θ,b𝒟i 𝒟i 𝒯k ℬbℒk(θ, 𝒟k )4. Backpropagate loss to compute gradient θ ℒ̂5. Apply gradient with your favorite neural net optimizer (e.g. Adam)Note: This ensures that tasks are sampled uniformly, regardless of data quantities.Tip: For regression problems, make sureyourtasklabelsareonthesamescale!20

Challenges21

Challenge #1: Negative transferNegative transfer:Sometimes independent networks work the best.Multi-Task CIFAR-100}multi-head architectures} cross-stitch architecture} independent trainingrecent approaches(Yu et al. Gradient Surgery for Multi-Task Learning. 2020)Why?--optimization challenges- caused by cross-task interference- tasks may learn at different rateslimited representational capacity- multi-task networks often need to be much largerthan their single-task counterparts22

If you have negative transfer, share less across tasks.It’s not just a binary decision!minθ sh,θ 1, ,θTshiℒi({θ , θ }, 𝒟i) T i 1T tt′ θ θ t′ 1“soft parameter sharing”y1 - - - - xconstrained weightsyTx allows for more fluid degrees of parameter sharing- yet another set of design decisions / hyperparameters23

Challenge #2: OverfittingYou may not be sharing enough!Multi-task learning - a form of regularizationSolution: Share more.24

Challenge #3: What if you have a lot of tasks?Should you train all of them together? Which ones will be complementary?The bad news: No closed-form solution for measuring task similarity.The good news: There are ways to approximate it from one training run.Fifty, Amid, Zhao, Yu, Anil, Finn. Efficiently Identifying Task Groupings for Multi-Task Learning. 202125

Plan for TodayMulti-Task Learning- Problem statement- Models, objectives, optimization- Challenges- Case study of real-world multi-task learningTransfer Learning- Pre-training & fine-tuning26

Case studyGoal: Make recommendations for YouTube27

Case studyGoal: Make recommendations for YouTube-Conflicting objectives:-videos that users will rate highlyvideos that users they will sharevideos that user will watchimplicit bias caused by feedback:user may have watched it because it was recommended!28

Framework Set-UpInput: what the user is currently watching (query video) user features1. Generate a few hundred of candidate videos2. Rank candidates3. Serve top ranking videos to the userCandidate videos: pool videos from multiplecandidate generation algorithms- matching topics of query video- videos most frequently watched with query video- And othersRanking: central topic of this paper29

The Ranking ProblemInput: query video, candidate video, user & context featuresModel output: engagement and satisfaction with candidate videoEngagement:Satisfaction:- binary classification tasks like clicks- binary classification tasks like clicking “like”- regression tasks for tasks related to time spent - regression tasks for tasks such as ratingWeighted combination of engagement & satisfaction predictions - ranking scorescore weights manually tunedQuestion: Are these objectives reasonable? What are some of the issues that might come up?30

The ArchitectureBasic option: “Shared-Bottom Model"(i.e. multi-head architecture)- harm learning when correlationbetween tasks is low31

The ArchitectureInstead: use a form of soft-parameter sharing“Multi-gate Mixture-of-Experts (MMoE)"Allow different parts of the network to “specialize"expert neural networksDecide which expert to use for input x, task k:Compute features fromselected expert:Compute output:32

ExperimentsResultsSet-Up--Implementation in TensorFlow, TPUsTrain in temporal order, running trainingcontinuously to consume newly arriving dataOffline AUC & squared error metricsOnline A/B testing in comparison toproduction system- live metrics based on time spent, surveyresponses, rate of dismissalsModel computational efficiency matters33Found 20% chance of gating polarization duringdistributed training - use drop-out on experts

Plan for TodayMulti-Task Learning- Problem statement- Models & training- Challenges- Case study of real-world multi-task learningTransfer Learning- Pre-training & fine-tuning34

Multi-Task Learning vs. Transfer LearningMulti-Task LearningTransfer LearningSolve multiple tasks 𝒯1, , 𝒯T at once.Solve target task 𝒯b after solving source task 𝒯aminθT i 1ℒi(θ, 𝒟i)by transferring knowledge learned from 𝒯aKey assumption: Cannot access data 𝒟a during transfer.Transfer learning is a valid solution to multi-task learning.(but not vice versa)Side note: 𝒯a may includemultiple tasks itself.Question: In what settings might transfer learning make sense?(answer in chat or raise hand)35

Transfer learning via fine-tuningpre-trained parameters tr r L( , D )(typically for many gradient steps)What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16Where do you get the pre-trained parameters?- ImageNet classifica7on- Models trained on large language corpora (BERT, LMs)- Other unsupervised learning techniques- Whatever large, diverse dataset you might havePre-trained models oOen available online.36training datafor new taskSome common prac6ces- Fine-tune with a smaller learning rate- Smaller learning rate for earlier layers- Freeze earlier layers, gradually unfreeze- Reini7alize last layer- Search over hyperparameters via cross-val- Architecture choices maLer (e.g. ResNets)

Universal Language Model Fine-Tuning for Text Classifica6on. Howard, Ruder. ‘18Fine-tuning doesn’t work well with small target task datasetsUpcoming lectures: few-shot learning via meta-learning37

Plan for TodayMulti-Task Learning- Problem statement- Models, objectives, optimization- Challenges- Case study of real-world multi-task learningTransfer Learning- Pre-training & fine-tuningGoals for by the end of lecture:- Know the key design decisions when building multi-task learning systems- Understand the difference between multi-task learning and transfer learning- Understand the basics of transfer learning38

RemindersNext time: Meta-learning problem statement, Black-box meta-learning, GPT-339

Plan for Today Multi-Task Learning -Problem statement-Models, objectives, optimization -Challenges -Case study of real-world multi-task learning Transfer Learning -Pre-training & fine-tuning3 Goals for by the end of lecture: -Know the key design decisions when building multi-task learning systems -Understand the difference between multi-task learning and transfer learning

Related Documents:

PSI AP Physics 1 Name_ Multiple Choice 1. Two&sound&sources&S 1∧&S p;Hz&and250&Hz.&Whenwe& esult&is:& (A) great&&&&&(C)&The&same&&&&&

Argilla Almond&David Arrivederci&ragazzi Malle&L. Artemis&Fowl ColferD. Ascoltail&mio&cuore Pitzorno&B. ASSASSINATION Sgardoli&G. Auschwitzero&il&numero&220545 AveyD. di&mare Salgari&E. Avventurain&Egitto Pederiali&G. Avventure&di&storie AA.&VV. Baby&sitter&blues Murail&Marie]Aude Bambini&di&farina FineAnna

The program, which was designed to push sales of Goodyear Aquatred tires, was targeted at sales associates and managers at 900 company-owned stores and service centers, which were divided into two equal groups of nearly identical performance. For every 12 tires they sold, one group received cash rewards and the other received

Registration Data Fusion Intelligent Controller Task 1.1 Task 1.3 Task 1.4 Task 1.5 Task 1.6 Task 1.2 Task 1.7 Data Fusion Function System Network DFRG Registration Task 14.1 Task 14.2 Task 14.3 Task 14.4 Task 14.5 Task 14.6 Task 14.7 . – vehicles, watercraft, aircraft, people, bats

College"Physics" Student"Solutions"Manual" Chapter"6" " 50" " 728 rev s 728 rpm 1 min 60 s 2 rad 1 rev 76.2 rad s 1 rev 2 rad , π ω π " 6.2 CENTRIPETAL ACCELERATION 18." Verify&that ntrifuge&is&about 0.50&km/s,∧&Earth&in&its& orbit is&about p;linear&speed&of&a .

theJazz&Band”∧&answer& musical&questions.&Click&on&Band .

WORKED EXAMPLES Task 1: Sum of the digits Task 2: Decimal number line Task 3: Rounding money Task 4: Rounding puzzles Task 5: Negatives on a number line Task 6: Number sequences Task 7: More, less, equal Task 8: Four number sentences Task 9: Subtraction number sentences Task 10: Missing digits addition Task 11: Missing digits subtraction

A02 x 2 One mark for the purpose, which is not simply a tautology, and one for development. e.g. The Profit and Loss Account shows the profit or loss of FSC over a given period of time e.g. 3 months, 1 year, etc. (1) It describes how the profit or loss arose – e.g. categorising costs between cost of sales and operating costs/it shows both revenues and costs (1) (1 1) (2) 3(b) AO2 x 2 The .