Re-Assessing The Usability Metric For User Experience .

2y ago
949.51 KB
21 Pages
Last View : 4m ago
Last Download : 1y ago
Upload by : Camille Dion

Vol. 11, Issue 3, May 2016 pp. 89–109Re-Assessing the Usability Metricfor User Experience (UMUX) ScaleMehmet Ilker BerkmanMSc, MABahcesehir UniversityFaculty of CommunicationIstanbul, KarahocaPhDBahcesehir UniversityFaculty of EngineeringIstanbul, ility Metric for User Experience (UMUX) and its shorterform variant UMUX-LITE are recent additions to standardizedusability questionnaires. UMUX aims to measure perceivedusability by employing fewer items that are in closerconformance with the ISO 9241 definition of usability, whileUMUX-LITE conforms to the technology acceptance model(TAM). UMUX has been criticized regarding its reliability,validity, and sensitivity, but these criticisms are mostlybased on reported findings associated with the data collectedby the developer of the questionnaire.Our study re-evaluates the UMUX and UMUX-LITE scalesusing psychometric methods with data sets acquired throughtwo usability evaluation studies: an online word processorevaluation survey (n 405) and a web-based mind mapsoftware evaluation survey for three applications (n 151).Data sets yielded similar results for indicators of reliability.Both UMUX and UMUX-LITE items were sensitive to thesoftware when the scores for the evaluated software werenot very close, but we could not detect a significantdifference between the software when the scores werecloser.UMUX and UMUX-LITE items were also sensitive to users’level of experience with the software evaluated in this study.Neither of the scales was found to be sensitive to theparticipants’ age, gender, or whether they were nativeEnglish speakers. The scales significantly correlated with theSystem Usability Scale (SUS) and the Computer SystemUsability Questionnaire (CSUQ), indicating their concurrentvalidity. The parallel analysis of principal components ofUMUX pointed out a single latent variable, which wasconfirmed through a factor analysis, that showed the datafits better to a single-dimension factor structure.Keywordsusability scale, psychometric evaluation, questionnaire,surveyCopyright 2015-2016, User Experience Professionals Association and the authors. Permission to make digital orhard copies of all or part of this work for personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copies bear this notice and the full citation onthe first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. URL:

90IntroductionQuestionnaires have been widely used to assess the users’ subjective attitude related to theirexperience of using a computer system. Since the 1980s, human-computer interaction (HCI)researchers have developed several standardized usability evaluation instruments. Psychometricmethodologies have been employed to develop standardized scales. Nunnally (1978) noted thatthe benefits of standardization are objectivity, replicability, quantification, economy,communication, and scientific generalization. Using standardized surveys, researchers can verifyothers’ findings by replicating former studies. Powerful statistical methods can be applied tocollected quantified data. Standardized questionnaires save the effort required for developing anew research instrument and allow researchers to communicate their results more effectively.After all, a series of research conducted with the same standardized questionnaire is useful fordeveloping generalized findings (Sauro & Lewis, 2012).The methods for developing a standardized scale (i.e., psychometric methods) are well definedin the literature and are applied to develop many scale instruments in clinical psychology,patient care, education, and marketing. Although these methods are well known, only a fewusability scales have been fully evaluated through psychometric methodologies (Sauro & Lewis,2012).The Usability Metric for User Experience (UMUX) scale is a new addition to the set ofstandardized usability questionnaires, and aims to measure perceived usability employing feweritems that are in closer conformity with the ISO 9241 (1998) definition of usability (Finstad,2010). Although the scale has been developed with a psychometric evaluation approach, it wascriticized regarding its validity, reliability, and sensitivity (Bosley, 2013; Cairns, 2013; Lewis,2013). However, critical statements are based on the results of the original study by Finstad(2010). The only study that has attempted to replicate the results of the original study withdifferent datasets (Lewis, Utesch, & Mahler, 2013) had consistent findings on the validity andreliability of UMUX. In reference to their findings, Lewis et al. (2013) proposed a shorter form ofUMUX, called UMUX-LITE, which is a two-item variant that is offered as a time-savingalternative.Our study contributes to the psychometric evaluation of UMUX and UMUX-LITE, presentingadditional evidence for their reliability, validity, and sensitivity by exploring the data collectedvia UMUX in two different usability evaluation studies. The study also provides a comparison ofadjusted (Lewis et al., 2013) and unadjusted UMUX-LITE scores.Psychometric EvaluationPsychometrics was intended for studies of individual differences (Nunnally, 1975). Over thedecades the methods of psychometrics were intensely used by researchers in the field ofpsychology and educational sciences. As those disciplines predominantly concentrate ondeveloping standardized scales to identify individual differences, psychometric methods havebeen of considerable interest in related literature. In the late 1980s, psychometric methodsdrew attention in HCI as well, because standardized scales became part of the usability testingprocess to assess the quality of use for a software product from a users’ subjective point ofview.Primary measures for a scale’s quality are reliability, validity, and sensitivity. Consistency ofmeasurement refers to the reliability of the scale. The extent to which a scale measures what itclaims to measure is the validity of a scale. Being reliable and valid, a scale should also besensitive to experimental manipulations such as changes made in the selection of participants orattributes of the evaluated products.The reliability of a scale can be assessed by three different approaches: test-retest reliability,alternate-form reliability, and internal consistency reliability. In the test-retest approach, scaleitems are applied to the same group of participants twice, leaving a time interval between twosessions. Alternate-form questionnaires are intended to measure the same concept with parallelitems introducing some changes to the wording and order of items. A high correlation betweentest-retest or two alternative forms of a questionnaire indicate reliability. However internalconsistency, which is typically equated with Cronbach's coefficient alpha, is a widely acceptedapproach to indicate a scale's reliability because it is easier to obtain from one set of data. TheJournal of Usability StudiesVol. 11, Issue 3, May 2016

91proportion of a scale's total variance that can be attributed to a latent variable underlying theitems is referred as alpha (DeVellis, 2011).Internal consistency estimates the average correlation among items within a test. If Cronbach’salpha, the indicator of correlation among items, is low, the test is either too short or items havevery little in common. Thus, Cronbach’s alpha is a highly rated indicator of reliability. It isreported that Cronbach’s alpha is remarkably similar to alternate-forms correlation in testswhen applied to more than 300 subjects (Nunnally, 1978).A scale should be examined for criterion validity and construct validity. Criterion validity couldbe studied in two sub-types: concurrent and predictive. Concurrent validity examines how wellone instrument stacks up against another instrument. A scale can be compared with a priorscale, seeking for a correlation between their results. This approach is extensively used inusability scale development studies. Predictive validity, on the other hand, is quite similar toconcurrent validity but it is more related to how the instrument’s results overlap with futuremeasurements. The Pearson correlation between the instrument and other measurementsemphasizes criterion validity.Construct validation requires the specification of a domain of observables related to theconstruct at the initial stage. A construct is an abstract and latent structure rather than beingconcrete and observable. Empirical research and statistical analyses, such as factor analysis,should be made to determine to which extent some of the observables in the domain tend tomeasure the same construct or several different constructs. Subsequent studies are thenconducted to determine the extent to which supposed measures of the construct are consistentwith “best guesses” about the construct (Nunnally, 1978).Another way of examining the constructs is to explore convergent validity, which seeks for thecorrelation of the instrument with another predictor. From an HCI point of view, survey resultscan also be compared to the user performance data gathered in usability test sessions.Significant correlations between measurements believed to correspond to the same constructprovide evidence of convergent validity. To clarify the constructs, discriminant validity can beexplored by checking if there is a mismatch between the instrument’s results with those of otherinstruments that claim to measure a dissimilar construct.For a researcher in clinical psychology, patient care, education, or marketing, sensitivity is thechanges in the responses to a questionnaire across different participants with differentattributes. For these disciplines, scales are designed to identify individual differences. However,from an HCI point of view, a usability scale is expected to be sensitive to the differencesbetween systems rather than those between people who use the system. When using the scaleto evaluate different systems, one would expect different results, which, in turn, is proof of thescale’s sensitivity.Psychometrics of Usability ScalesCairns (2013) characterized the evaluation of questionnaires as a series of questions within thecontext of usability. Validity can be characterized with the question, “Does the questionnairereally measure usability?” When searching for the face validity of a usability questionnaire,Cairns asked, “Do the questions look like sensible questions for measuring usability?”Convergent or concurrent validity seeks the answer to the question, “To what extent does thequestionnaire agree with other measures of usability?” Building on convergent validity, thepredictive validity of a questionnaire can be assessed by asking, “Does the questionnaireaccurately predict the usability of systems?” Discriminant validity is the degree that thequestionnaire differentiates "from concepts that are not usability, for example, trust, productsupport, and so on." Sensitivity, on the other hand, seeks to answer, “To what extent does themeasure pick up on differences in usability between systems?” (p. 312).There a two types of usability questionnaires: post-study and post-task. Post-studyquestionnaires are administered at the end of a study. Post-task questionnaires, which areshorter in form using three items at most, are applied right after the user completes a task togather contextual information for each task. We briefly reviewed the post-study questionnairesin the following section because UMUX, which can also be employed to collect data in a fieldsurvey, was initially designed to be used as a post-study questionnaire.Journal of Usability StudiesVol. 11, Issue 3, May 2016

92Post-Study QuestionnairesThe Questionnaire for User Interaction Satisfaction (QUIS) was developed as a 27-item, 9-pointbipolar scale, representing five latent variables related to the usability construct. Chin, Diehl,and Norman (1988) developed the scale by assessing 150 QUIS forms that were completed forthe evaluation of 46 different software programs. The study reported a significant difference inthe QUIS results collected for menu-driven applications and command line systems thatprovided evidence for the scales’ sensitivity.The Software Usability Measurement Inventory (SUMI) consists of 50 items with a 3-point Likertscale representing five latent variables (Kirakowski, 1996). Kirakowski’s research providedevidence for construct validity and sensitivity by reporting on the collection of over 1,000surveys that evaluated 150 different software products. Results affirm that the SUMI issensitive, as it distinguished two different word processors in work and laboratory settings,while it also produced significantly different scores for two versions of the same product.The Post-Study System Usability Questionnaire (PSSUQ) initially consisted of 19 items with a 7point Likert scale and a not applicable (N/A) option. The Computer System UsabilityQuestionnaire (CSUQ) is its variant for field studies (Lewis, 1992; 2002). Three latent variables(subscales), represented by 19 items, are system quality (SysUse), information quality(InfoQual), and interface quality (IntQual). Lewis (2002) offered a 16-item short version thatwas capable of assessing the same sub-dimensions and used data from 21 different usabilitystudies to evaluate the PSSUQ. He explored the sensitivity of the PSSUQ score for significanceof difference to several conditions, such as the study during which the participants completedthe PSSUQ, the company that developed the evaluated software, the stage of softwaredevelopment, the type of software product, the type of evaluation, the gender of participants,and the completeness of survey form. As a variant of PSSUQ, CSUQ is designed to assess theusability of a software product without conducting scenario-based usability tests in a laboratoryenvironment (Lewis, 1992; 1995; 2002). Thus, CSUQ is useful across different user groups andresearch settings.The System Usability Scale (SUS) was developed for a “quick and dirty” evaluation of usability(Brooke, 1996). Although “it had been developed at the same time period with PSSUQ, it hadbeen less influential since there had been no peer-reviewed research published on itspsychometric properties” (Lewis, 2002, p. 464) until the end of the 2000s. After it wasevaluated through psychometric methods (Bangor, Kortum, & Miller, 2008; Lewis & Sauro,2009), it was validated as a unidimensional scale, but some studies suggested that its itemsrepresent two constructs: usable and learnable (Borsci, Federici, & Lauriola, 2009; Lewis &Sauro, 2009). SUS consists of 10 items with a 5-point Likert scale. It is reported to providesignificantly different scores for different interface types (Bangor et al, 2008) and for differentstudies (Lewis & Sauro, 2009). Although the SUS score is not affected by gender differences,there is a correlation between the age of participants and the score given to the evaluatedapplications. It is known that SUS items are not sensitive to participants’ native language aftera minor change in Item 8, where the word “cumbersome” is replaced with “awkward” (Finstad,2006).UMUX has four items with a 7-point Likert scale with a Cronbach’s alpha coefficient of .94.Lewis, Utesch, and Maher (2013) reported the coefficient alpha as .87 and .81 for two differentsurveys. Finstad reported a single underlying construct that conformed to the ISO 9241definition of usability. However, Lewis et al. (2013) stated that “UMUX had a clear bidimensionalstructure with positive-tone items aligning with one factor and negative-tone items aligning withthe other” (p. 2101). They also reported that UMUX significantly correlated with the standardSUS (r .90, p .01) and another version of SUS in which all items are aligned to have apositive tone (r .79, p .01). These values are lower than the correlation between SUS andUMUX reported in the original study by Finstad (2010; r 96, p .01). However, moderatecorrelations (with absolute values as small as .30 to .40) are often large enough to justify theuse of psychometric instruments (Nunnally, 1978). Accordingly, both studies provided evidencefor the concurrent validity of UMUX. To investigate the sensitivity of UMUX to differencesbetween systems, Finstad (2010) conducted a survey study of two systems (n 273; n 285).The t tests denoted that both UMUX and SUS produce a significant difference between thescores of the two systems.Journal of Usability StudiesVol. 11, Issue 3, May 2016

93The two-item variant of UMUX—UMUX-LITE (Lewis et al., 2013)—is based on the two positivetone items of UMUX, which are items 1 and 3. These items have a connection with thetechnology acceptance model (TAM) from the market research literature, which assessesusefulness and ease-of-use. UMUX-LITE has a reliability estimate of .82 and .83 on two differentsurveys, which is excellent for a two-item survey. These items correlated with standard andpositive versions of SUS at .81 and .85 (p .01). Correlation of UMUX-LITE with a likelihood-torecommend (LTR) item was above .7. These findings indicated concurrent validity of UMUXLITE. On the other hand, Lewis et al. (2013) reported a significant difference between SUS andUMUX-LITE scores that were calculated based on items 1 and 3 of UMUX. For this reason, theyhave adjusted the UMUX-LITE score with a regression formula to compensate for the difference.A recent study (Lewis, Utesch, & Maher, 2015) confirmed that the suggested formula workedwell on an independent data set. Borsci, Federici, Bacci, Gnaldi, and Bartolucci (2015) alsoreplicated previous findings of similar magnitudes for SUS and adjusted UMUX-LITE.Table 1 gives a quick review of the post-study questionnaires and presents information abouttheir psychometric evaluation.Table 1. Post-Study QuestionnairesEvidence forvalidityEvidence forsensitivityNumber sNo. ofsubscalesScaletypeNo. of itemsScale nameQUIS275Bipolar(9).94YesYesChin et al., 1988150SUMI505Likert (3) .92YesYesKirakowski, 19961,000 PSSUQ163Likert (7) .94 N/AoptionYesYesLewis, 199248Lewis, 2002210CSUQ163Likert (7) .89 N/AoptionYesYesLewis, 1995377SUS102Likert (5) .92YesYesLewis & Sauro,2009324.91YesYesBangor et al.,20082,324Likert (7) .94YesYesFinstad, 2010558-Lewis et al., 2013 402389YesLewis et al., 2013 402UMUX43.81 .87 YesUMUX-LITE2-Likert (7) .81 .87 Yes389Criticism of UMUX and Related WorkLewis (2013) criticized Finstad (2010) for his “motivation for scale development,” “structuralanalysis of the scale factors,” and “scale validity.” Major criticism of structural analysis pointsout that UMUX should be evaluated not only with a principal component analysis for itsfactorability, but Finstad should also have conducted “a rotated principal or confirmatory(maximum likelihood) factor analysis to evaluate the possibility that UMUX has two or moreunderlying factors” (Lewis, 2013, p. 322). Finstad (2013) conducted a maximum-likelihoodJournal of Usability StudiesVol. 11, Issue 3, May 2016

94analysis to respond to this criticism using the data he collected in the 2010 study. Resultsverified the one-factor solution.Criticism of scale validity confirms Finstad’s (2013) statement about the limitation of the studyand that an industrial study should be conducted to confirm the scale’s correlation with “taskbased metrics like completion rates, completion times, and error rates” (p. 328). Cairns (2013)also criticized the UMUX study for the same reason in that it does not attempt “to directlyposition itself in relation to objective measures of usability” (p. 314). He pointed out that theshortness of the questionnaire would also cause participants to provide “socially desirable”responses wh

The Usability Metric for User Experience (UMUX) scale is a new addition to the set of standardized usability questionnaires, and aims to measure perceived usability employing fewer items that are in closer conformity with the ISO 9241 (1998) definition of usability (Finstad, 2010).

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

D. Metric Jogging Meet 4 E. Your Metric Pace 5 F. Metric Estimation Olympics. 6 G: Metric Frisbee Olympics 7 H. Metric Spin Casting ,8 CHAPTER III. INDOOR ACTIVITIES 10 A. Meteic Confidence Course 10 B. Measuring Metric Me 11 C. Metric Bombardment 11 D. Metric

Button Socket Head Cap Screws- Metric 43 Flat Washers- Metric 18-8 44 Hex Head Cap Screws- Metric 18-8 43 Hex Nuts- Metric 18-8 44 Nylon Insert Lock Nuts- Metric 18-8 44 Socket Head Cap Screws- Metric 18-8 43 Split Lock Washers- Metric 18-8 44 Wing Nuts- Metric 18-8 44 THREADED ROD/D

Kareo EHR Usability Study Report of Results EXECUTIVE SUMMARY A usability test of Kareo EHR version 4.0 was conducted on August 21 & 23, 2017, in Irvine, CA by 1M2Es, Inc. The purpose of this study was to test and validate the usability of the current user interface, and provide evidence of usability in the EHR Under Test (EHRUT). During the

The new industry standard ANSI A300 (Part 4) – 2002, Lightning Protection Systems incorporates significant research in the field of atmospheric meteorology. This relatively new information has a pro-found impact on the requirements and recommendations for all arborists who sell tree lightning protection systems. Since there are an average of 25 million strikes of lightning from the cloud to .