Roger Longbotham, Mgr Analytics, Experimentation Platform, Microsoft

1y ago
19 Views
3 Downloads
3.54 MB
57 Pages
Last View : 2d ago
Last Download : 2m ago
Upload by : Braxton Mach
Transcription

Roger Longbotham,Mgr Analytics, Experimentation Platform, MicrosoftSlides available at http://exp-platform.com

What to measureHow to compare Treatment to ControlHow long to run testStart up optionsGood test designData validation and cleansingBefore your first experimentCommon errorsMultiVariable Tests

Start with objectiveOf the site (content, ecommerce, marketing, help/support, )Of the experimentWhat can you measure to tell you if you met your objective?Content site: clicks/user, pageviews/user, time on siteEcommerce: rev/visitor, units purchased/visitor, cart-adds/visitorMarketing: referrals/visitor, time on siteHelp/support: Pct of users engaged, Pct of users who print,email or download content, time on site

Measures of user behaviorNumber of events (clicks, pageviews, scrolls, downloads, etc)Time (minutes per session, total time on site, time to load page)Value (revenue, units purchased)Experimental unitsPer user (e.g. clicks per user)Per session (e.g. minutes per session)Per user-day (e.g. pageviews per user per day)Per experiment (e.g. clicks per pageview)

It is very helpful to have a single metric that summarizeswhether the Treatment is successful or not – the OverallEvaluation Criterion, or OECExamples:Content site: OEC could be clicks/user or time on siteEcommerce: rev/user or lifetime valueHelp/support site: Survey responses or user engagementOEC could also capture monetary value of the Treatmenteffect, aka ROI (return on investment)

Single TreatmentTwo-sample t test works wellLarge samples sizes Normal distribution for meansCalculate 95% Confidence Interval for difference in two means(𝑋𝑇 𝑋𝐶 ) 1.96 𝑠𝑋𝑇 𝑋 𝐶if zero not in the interval conclude Treatment mean different from ControlMay have many tests, OEC criticalMultiple TreatmentsMultiple applications of two-sample t testAnalysis of Variance

Note:Averages forboth variantsP-valuesPercent changeSignificanceConfidenceIntervals103 metrics

P-value is the probability of getting a difference farther fromzero than observed under assumption of no differenceCI for percent effect must use special formulasCare must be taken in calculating standard deviationsWhen randomization is by user, any metric that is not per usermust take into account non-independence in calculatingstandard deviationWe routinely use bootstrapping to estimate standard deviations

n 16 * r * 2D2The power of a test is the probability of detecting adifference (D) of a given size i.e., it is 1-Prob(Type II error)Power depends on The size of effect you want to be able to detect, DVariability of the metricNumber of users in each group (T/C)It is typical to determine the sample size needed toachieve 80% power

Example: Total number of users needed to achieve 80%power, with equal number of users in Treatment and Controland with standard deviation s isN 32 * s 2D2

Often good practice is to start with small percent inTreatment and increase when you have confidenceTreatment is bug-freeSample ramp up schedule:1% in Treatment for 4 hours5% in Treatment for 4 hours20% in Treatment for 4 hours50 % in Treatment for 14 daysRamp-up period

Example: Real Estate widget designTest five alternatives to the current designOEC: clicks to links weighted by revenue per clickControlT3T1T4T2T5

The widget that performed the best was the simplestRevenue increase over control: 9.7%Note Ronny’s example earlier compared the best Treatment to another Treatment, not the Control

TriggeringBlockingMeasuring non-test factorsRandomization

Only allow users into your experiment if they “trigger” theexperiment. i.e. a user’s data should only be used in theanalysis of the experiment if they saw one of the variantsExample: MSN UK Hotmail experimentControl: When user clicks on emailhotmail opens in same windowTreatment: Open hotmail inseparate windowWhich users do you want to track as part of your experiment?

Factor is controlled such that it affects both treatment andcontrol equally, hence not affecting the estimate of theeffectBlocking on a factor is more common than keeping it fixed(keeping it constant throughout the experiment)Advantages to blockingCan estimate the effect of the different levels of the factor, e.g. what isthe effect on weekends/weekdaysCan make inference to a broader population

Time (time of day, day of week, etc.)Bad test design run control at 100% M-Wthen treatment at 100% Th-SaAlways run treatment and control concurrently in onlineexperimentsContentEx: If content of a site changes during the experiment it must be thesame for both Treatment and Control at all times

The Treatment and Control groups should be as alike aspossible except for application of the treatmentWho is in the experimentWhat is done during the experimentetc.Updates to the site during the test must be applied to allvariants in the test

Example: One partner was conducting an A/A test (same asan A/B test but no real change is made) What would youexpect?Results: Treatment very significant (much more than itshould be) Why?Found out another group was using their Treatment groupto test something so there really was a difference betweenT and C

10/19/07 14:0010/19/07 10:0010/19/07 6:0010/19/07 2:0010/18/07 22:0010/18/07 18:0010/18/07 14:0010/18/07 10:0010/18/07 6:0010/18/07 2:0010/17/07 22:0010/17/07 18:0010/17/07 14:0010/17/07 10:0010/17/07 6:0010/17/07 2:0010/16/07 22:0010/16/07 18:0010/16/07 14:0010/16/07 10:0010/16/07 6:0010/16/07 2:0010/15/07 22:0010/15/07 18:0010/15/07 14:00Clickthrough RateEx: A site was testing a change to the layout of their pageContent to T and C was not the same for a 7 hour periodHourly Clickthrough Rate for Treatment and Control for Module1.2%1.0%0.8%0.6%CTR Control0.4%CTR Tmt0.2%0.0%

Measuring the value of non-test factors allows you to Delve into why the treatment had the effect it did (e.g. more PVsare correlated with faster load time which explains almost all theeffect of the Treatment)Determine if subpopulations behave the same (e.g. did theTreatment have the same effect for new users as for returningusers?)

Why randomize?So that those factors you can’t control (or don’t knowabout) don’t bias your resultsUnknownFactors“Randomization is too important to be left to chance”Robert Coveyou, ORNL

How to randomize? (online tests)Randomly assign T or C to user (alternately could use user-session,search query, page view or product/SKU)Usually best by user (store UserID in cookie)How persistent is the UID?Ideally user always gets same treatment groupLimitations:Clearing cookies can change treatmentDifferent computer/browser may get different treatmentCan’t allow opt-in or opt-out

Make sure users and conditions are as representative oflaunch environment as possibleTime period: not holiday (unless holiday factor), pre-holiday,complete cycle (day, week)Users: all users who would see T in the future,not robots, not internal testers, outliers(?)Not during special events

Remove robots (web crawlers, spiders, etc.) from analysisThey can generate many pageviews or clicks in Treatment orControl skewing the resultsRemove robots with known identifiers (found in the user agent)Develop heuristics to identify robots with many clicks orpageviews in short period of timeOther patterns may be used to identify robots as well, such asvery regular activity

6/7/07 15:006/7/07 22:006/8/07 5:006/8/07 12:006/8/07 19:006/9/07 2:006/9/07 9:006/9/07 16:006/9/07 23:006/10/07 6:006/10/07 13:006/10/07 20:006/11/07 3:006/11/07 10:006/11/07 17:006/12/07 0:006/12/07 7:006/12/07 14:006/12/07 21:006/13/07 4:006/13/07 11:006/13/07 18:006/14/07 1:006/14/07 8:006/14/07 15:006/14/07 22:006/15/07 5:006/15/07 12:006/15/07 19:006/16/07 2:006/16/07 9:006/16/07 16:006/16/07 23:006/17/07 6:006/17/07 13:006/17/07 20:006/18/07 3:006/18/07 10:006/18/07 17:006/19/07 0:006/19/07 7:006/19/07 14:006/19/07 21:006/20/07 4:006/20/07 11:00Each hourrepresentsclicks fromthousandsof usersThe “spikes”can be tracedto single “users”(robots)Clicks for Treatment minus Control by Hour for A/A testNo Robots Removed80006000400020000-2000-4000-6000-8000

Carry out checks to make sure data is not affected by someunknown factorCheck that percentage of users in each variant is not differentfrom planned (statistical test)Check that number of users in the experiment is approximatelywhat was expected (and doesn’t change too much duringexperiment)Check that the Treatment effect does not change too muchduring experimentCheck that means for primary metrics do not changeunexpectedlyAlways plot the data over time

Conduct logging auditCompare data collected for experiment to system of recordShould have approximately same number of users, clicks,pageviews, orders, etc.Conduct A/A testSplit users into two groups that get same experienceShould have about 5% of tests significantp-values should have U(0,1) distributionNo p-values should be extremely small (say .001)

Not conducting logging or A/A testsFind caching issues, UID reassignmentNot keeping all factors constant or blockingContent changes to siteRedirect for Treatment but not for ControlSample size too smallNot measuring correct metric for OECMeasure clicks to buy button (instead of revenue)Clicks to download button (instead of completed downloads)

Several factors/variables, each of which has two or morelevels (C/T1/T2/ )Main effects: Comparison of Treatments to Control for eachvariable (i.e. compare means for T and C same as before)Interactions: Determine if combinations of variables havedifferent effect than adding main effects

Factors/variablesF1: Size of Right col adC current sizeT1 10% largerT2 10% smallerF1F2F2: MSNBC news storiesC Top internationalT Specific to country ID’dF3: Sports/Money placementC Sports above MoneyT Money above SportsF3OEC: Clicks per UserOther metrics: PVs, CTR(This is for illustration purposes only, it does not reflect any previous or planned test on MSN HP)

Advantages:– Can test many things at once, accelerating innovation– Can estimate interactions between factorsDisadvantages– Some combinations of factors may give negative customerexperience– Analysis and interpretation is more difficult– May take longer to set up test

On-line experiments can simply run overlapping, concurrent,independently randomized experimentsExample: Test 7 factors each at 2 levelsSet up 7 separate experiments to run at the same time withthe same users. Get all 128 combinations in the results.Advantages:– Easier to implement– Can turn off one experiment if negative– Get all interactions

Procedure for analyzing an MVT for interactions1. Since there are potentially a vary large number of interactionsamong the variables being tested, restrict the ones you willlook at to a few you suspect may be present. (If 7 factors, 21two-factor interactions, 35 three-factor interactions, etc.)2. Conduct the test to determine if the interaction between twofactors is present or not3. If interaction is not significant, stop!If the interaction IS significant, look at the graphical output tointerpret.

Example: Factors from MSN HP illustrationF2: MSNBC news storiesC Top internationalT Specific to country ID’dF3 Sports/Money placementC same order every dayT Sports higher on wkendsand Money higher wkdaysHypothesis tests for interactions similar to main effects(details omitted)

Factors/variablesF2: MSNBC news storiesC Top internationalT Specific to country ID’dF2F3: Sports/Money placementC Sports above MoneyT Money above SportsOEC: Clicks per UserOther metrics: PVs, CTRF3(This is for illustration purposes only, it does not reflect any previous or planned test on MSN HP)

If hypothesis test for interaction is not significantAssume no interaction presentInteraction graph would show lines approximately parallelIf interaction is statistically significantPlot interaction to interpret

Case 1: No Interaction (parallel lines)Data TableF2 - C4.064.08F3 - CF3 - TMain Effects ResultsF2 - T4.104.12Pct Effect p-valueEffect(F2)0.98% .001Effect(F3)0.49%0.032F2xF3 Interaction4.13Average Clicks per User4.12No Interaction4.114.104.09F3 - CF3 - T4.084.074.064.05F2 - CF2 - T

When interaction is statistically significantTwo types of interactions:Synergistic – when the presence of both is more than the sum ofthe individual treatmentsAntagonistic – when the presence of both is less than the sum ofthe individuals

Case 2: Synergistic InteractionData TableF2 - C4.084.08F3 - CF3 - TMain Effects ResultsF2 - T4.094.13Pct Effect F3F2xF3 sAverageDays4.144.144.134.13SynergisticSynergistic InteractionInteraction4.124.124.114.11F3 - CF3 - CF3 - tialF2F2 -- CCF2F2--TT

Case 3: Antagonistic InteractionData TableF2 - C4.084.12F2 - T4.114.11F2xF3F2xF3 aysNumberAverageF3 - CF3 - TMain Effects ResultsPct Effect agonistic Interaction4.134.124.11F3F3-C-CF3 - T4.104.094.084.07F2 - C ConfidentialMicrosoftF2F2-- TT

Current Model Pre-roll ad played before first content stream Don’t disturb users by playing ad when a content stream is playing Ad stream played before the content stream when content streamsplayed for more than 180 seconds continuously

Business QuestionsCould removing pro-roll ad stream attract more returning users?Could shortening the minimum time between two ad streams attractmore returning users?Would ad stream gain from returning users offset the loss of notplaying pre-roll or playing ad less frequently?

Experiment DesignFactor 1: Play (Control) or Do Not Play pre-rollFactor 2: 5 levels of minimum time between two ad streams90, 120, 180 (Control), 300, 900 secondsUsers who received treatments in two week observation windowcontinued to receive treatments and were monitored for thefollowing six weeks for their return rate

Assuming the Overall Evaluation Criterion (OEC) is Percent ofReturning UsersVote for result on Factor 1:1. Playing pre-roll is statistically significantly better2. Flat (no statistical difference)3. Playing pre-roll is statistically significantly worse

Vote for result on Factor 2: which of the following attractstatistically significantly more returning users1.2.3.4.5.6.90 seconds120 seconds180 seconds300 seconds900 secondsFlat (no difference)

Return Rate by Factor 113%12%11%10%Ad9%Content8%7%6%week 1week 2week 3week 4week 5week 6

Return Rate by Factor 213%12%11%9012018030090010%9%8%7%6%week 1week 2week 3week 4week 5week 6

Content FirstAds FirstContent - 90Content - 120Content - 180Content - 300Content - 900Ad - 90Ad - 120Ad - 180Ad - 300Ad - 900week 1week 2week 3week 4week 5week 6

Variance calculations for metricsNon-parametric alternatives to t-test, ANOVARobot detectionAutomatic detection of interesting population segmentsExperimentation with exploration/exploitation schemesPredicting when a metric will be significant

Metrics that are not “per user” currently use bootstrap toestimate varianceCan we get a formula to take into account correlation ofexperimental units?Example: Clickthrough rate (CTR) per experimentTrue variance is much larger than that from Binomial distribution

Permutation or Mann-Whitney tests are naturalProsCan get a p-valueMay have better power for some metricsWorks better for small sample sizesConsUnderstandability by business managersCan be computationally intensiveConfidence intervals for effect not straight-forward

What is “best” way to develop heuristics to detect robots?What is “best” way to assess how well heuristics are doing?How to adjust robot detection parameters based on site inthe test?For exampleSites with low traffic may need more aggressive robot filteringSites that expect active users (e.g. many clicks per hour) needless aggressive robot filteringSites that have more robot traffic may need more aggressiverobot filtering

A population segment is interesting if their response to theTreatment is different from the overall responseSegments can be defined by a number of variablesBrowser or operating systemReferrer (e.g. from search engine, etc.)Signed-in statusLoyaltyDemographicsLocation – country, state, size of city (use IP lookup)Bandwidth

Want to automatically display best content based onexploration/exploitation strategyIs this strategy better than editor-placed content?What are the optimal parameter values?Percent in exploration group?How long to test content in exploration group?What level of significance is needed?

After experiment has run for some period of time and haveestimates of effect and standard deviation can we give ahelpful estimate of how long experiment needs to run inorder to get a significant result for a particular metric?Statistical philosophical issuesTechnical issues

Of the site (content, ecommerce, marketing, help/support, ) Of the experiment What can you measure to tell you if you met your objective? Content site: clicks/user, pageviews/user, time on site Ecommerce: rev/visitor, units purchased/visitor, cart-adds/visitor Marketing: referrals/visitor, time on site

Related Documents:

Mgr. Julie Útratová Člověk a svět práce Ludmila Kadlečková Mgr. Jana Mašková Mgr.Veronika Matějková Mgr. Jindřiška Skalická Mgr. Zdeňka Wagenknechtová Mgr. Julie Útratová Lubomír Zikmund DALŠÍ FUNKCE Výchovný poradce Bc.V

Phonak DECT II Phonak PilotOne II EasyCall II Phonak ComPilot II Phonak ComPilot Air II Phonak TVLink II Phonak RemoteMic 1 Roger Roger 18 Roger 19 Roger X / AS18 Roger X / AS19 Roger X / Phonak ComPilot II Roger MyLink Gamme d

Phonak PilotOne II EasyCall II Phonak ComPilot II Phonak ComPilot Air II Phonak TVLink II Phonak RemoteMic Roger Roger 18 Roger 19 Roger X / AS18 Roger X / AS19 Roger X / Phonak ComPilot II Roger MyLi

2. Getting to know your Roger MyLink 6 2.1Compatibility 7 2.2Device description 7 2.3Indicator light 9 3. Getting started 14 Step 1. Charge your Roger MyLink 14 Step 2. Detach the neckloop 16 Step 3. Hang Roger MyLink around the neck and reattach the loop 16 Step 4. Switch Roger MyLink on 17 Step 5. Choose how to wear Roger MyLink 17 Step 6.

2 1. Welcome 7 2. Getting to know your Roger inspiro 8 3. Getting started 10 3.1 Charging Roger inspiro 10 3.2 Switching Roger inspiro on 12 3.3 Wearing Roger inspiro 13 3.4 Wearing the iLapel microphone 17 3.5 Wearing the optional EasyBoom microphone 19 3.6 Muting the microphone 21 3.7 Activating the keypad lock 22 4. Using Roger inspiro 23

web: survey and practical guide (Kohavi, Longbotham, Sommerfield, & Henne, 2009) for more details. 3. Experimentation at Microsoft The most important and visible outcropping of the action bias in the excellent

(N 78) SYS ENGR RD&A DASN Ships COTF Dir., T E 091 SEA – 08 Nuclear SYSCOM Air PEO Carriers SPAWAR C4I Labs PM CVN 21 APM T&E BFM APM Design/ Build MGR Aviation Systems MGR Ship Design MGR Integrated Warfare Systems USD AT&L NAVSEA 05 CONV. SYS. LFT&E T& E Team CSC Naval Surface Warfare C

The Battle of the Bulge, also called the Ardennes Offensive, was the last major German offensive on the Western Front during World War II - an unsuccessful attempt to push the Allies back from German home territory. The name Battle of the Bulge was appropriated from Winston Churchill’s optimistic description in May 1940 of the resistance that he mistakenly supposed was being offered to the .