Test Reliability—Basic Concepts

2y ago
19 Views
2 Downloads
557.49 KB
46 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jacoby Zeller
Transcription

Research MemorandumETS RM–18-01Test Reliability—Basic ConceptsSamuel A. LivingstonJanuary 2018

ETS Research Memorandum SeriesEIGNOR EXECUTIVE EDITORJames CarlsonPrincipal PsychometricianASSOCIATE EDITORSBeata Beigman KlebanovSenior Research ScientistJohn MazzeoDistinguished Presidential AppointeeHeather BuzickSenior Research ScientistDonald PowersPrincipal Research ScientistBrent BridgemanDistinguished Presidential AppointeeGautam PuhanPrincipal PsychometricianKeelan EvaniniResearch DirectorJohn SabatiniManaging Principal Research ScientistMarna Golub-SmithPrincipal PsychometricianElizabeth StoneResearch ScientistShelby HabermanDistinguished Research Scientist, EdusoftRebecca ZwickDistinguished Presidential AppointeeAnastassia LoukinaResearch ScientistPRODUCTION EDITORSKim FryerManager, Editing ServicesAyleen GontzSenior EditorSince its 1947 founding, ETS has conducted and disseminated scientific research to support its products andservices, and to advance the measurement and education fields. In keeping with these goals, ETS is committed tomaking its research freely available to the professional community and to the general public. Published accountsof ETS research, including papers in the ETS Research Memorandum series, undergo a formal peer-review processby ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conductedpeer reviews are in addition to any reviews that outside organizations may provide as part of their own publicationprocesses. Peer review notwithstanding, the positions expressed in the ETS Research Memorandum series and otherpublished accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees ofEducational Testing Service.The Daniel Eignor Editorship is named in honor of Dr. Daniel R. Eignor, who from 2001 until 2011 served theResearch and Development division as Editor for the ETS Research Report series. The Eignor Editorship has beencreated to recognize the pivotal leadership role that Dr. Eignor played in the research publication process at ETS.

Test Reliability—Basic ConceptsSamuel A. LivingstonEducational Testing Service, Princeton, New JerseyJanuary 2018Corresponding author: S. A. Livingston, E-mail: slivingston@ets.orgSuggested citation: Livingston, S. A. (2018). Test reliability—Basic concepts (Research Memorandum No. RM-18-01).Princeton, NJ: Educational Testing Service.

Find other ETS-published reports by searching the ETS ReSEARCHERdatabase at http://search.ets.org/researcher/To obtain a copy of an ETS research report, please n Editor: Gautam PuhanReviewers: Shelby Haberman and Marna Golub-SmithCopyright 2018 by Educational Testing Service. All rights reserved.ETS, the ETS logo, GRE, MEASURING THE POWER OF LEARNING, and TOEFL are registered trademarks ofEducational Testing Service (ETS). All other trademarks are the property of their respective owners.

AbstractThe reliability of test scores is the extent to which they are consistent across different occasionsof testing, different editions of the test, or different raters scoring the test taker’s responses. Thisguide explains the meaning of several terms associated with the concept of test reliability: “truescore,” “error of measurement,” “alternate-forms reliability,” “interrater reliability,” “internalconsistency,” “reliability coefficient,” “standard error of measurement,” “classificationconsistency,” and “classification accuracy.” It also explains the relationship between the numberof questions, problems, or tasks in the test and the reliability of the scores.Key words: reliability, true score, error of measurement, alternate-forms reliability, interraterreliability, internal consistency, reliability coefficient, standard error of measurement,classification consistency, classification accuracyRM-18-01 i

PrefaceThis guide grew out of a class that I teach for staff at Educational Testing Service (ETS). Theclass is a nonmathematical introduction to the topic, emphasizing conceptual understanding andpractical applications. The class consists of illustrated lectures, interspersed with writtenexercises for the participants. I have included the exercises in this guide, at roughly the samepoints as they occur in the class. The answers are in the appendix at the end of the guide.In preparing this guide, I have tried to capture as much as possible of the conversationalstyle of the class. I have used the word “we” to refer to myself and most of my colleagues in thetesting profession. (We tend to agree on most of the topics discussed in this guide, and I think itwill be clear where we do not.)RM-18-01 ii

Table of ContentsInstructional Objectives . 1Prerequisite Knowledge . 2What Factors Influence a Test Score? . 2The Luck of the Draw . 3Reducing the Influence of Chance Factors . 4Exercise: Test Scores and Chance . 5What Is Reliability? . 6Reliability Is Consistency . 6Reliability and Validity. 7Exercise: Reliability and Validity. 8Consistency of What Information? . 8“True Score” and “Error of Measurement” . 9Reliability and Measurement Error . 11Exercise: Measurement Error . 11Reliability and Sampling. 12Alternate-Forms Reliability and Internal Consistency . 13Interrater Reliability . 15Test Length and Reliability. 16Exercise: Interrater Reliability and Alternate-Forms Reliability. 17Reliability and Precision . 18Reliability Statistics . 19The Reliability Coefficient . 19The Standard Error of Measurement . 20How Are the Reliability Coefficient and the Standard Error of Measurement Related? . 22Test Length and Alternate-Forms Reliability . 22Number of Raters and Interrater Reliability . 24Reliability of Differences Between Scores . 25Demystifying the Standard Error of Measurement . 26Exercise: The Reliability Coefficient and the Standard Error of Measurement . 26Reliability of Essay Tests. 27RM-18-01 iii

Reliability of Classifications and Decisions . 28Summary . 30Acknowledgments. 32Appendix. Answers to Exercises . 32Exercise: Test Scores and Chance . 32Exercise: Reliability and Validity. 33Exercise: Measurement Error . 33Exercise: Interrater Reliability and Alternate-Forms Reliability. 35Exercise: The Reliability Coefficient and the Standard Error of Measurement . 35Notes . 37RM-18-01 iv

S. A. LivingstonTest Reliability—Basic ConceptsInstructional ObjectivesHere is a list of things I hope you will be able to do after you have read this guide anddone the written exercises: List three important ways in which chance can affect a test taker’s score and somethings that test makers can do to reduce these effects. Give a brief, correct explanation of the concept of test reliability. Explain the difference between reliability and validity and how these two conceptsare related. Explain the meaning of the terms “true score” and “error of measurement” and why itis wise to avoid using these terms to communicate with people outside the testingprofession. Give an example of an unwanted effect on test scores that is not considered “error ofmeasurement.” Explain what alternate-forms reliability is and why it is important. Explain what interrater reliability is, why it is important, and how it is related toalternate-forms reliability. Explain what “internal consistency” is, why it is often used to estimate reliability, andwhen it is likely to be a poor estimate. Explain what the reliability coefficient is, what it measures, and what additionalinformation is necessary to make it meaningful. Explain what the standard error of measurement is, what it measures, and whatadditional information is necessary to make it meaningful. Explain how the reliability coefficient and the standard error of measurement arerelated. Describe the relationship between the length of a test (the number of questions,problems, or tasks) and its alternate-forms reliability.RM-18-01 1

S. A. Livingston Test Reliability—Basic ConceptsExplain why the length of a constructed-response test (the number of separate tasks)often affects its interrater reliability. Explain what “classification consistency” and “classification accuracy” are and howthey are related.Prerequisite KnowledgeThis guide emphasizes concepts, not mathematics. However, it does include explanationsof some statistics commonly used to describe test reliability. I assume that the reader is familiarwith the following basic statistical concepts, at least to the extent of knowing and understandingthe definitions given below. These definitions are all expressed in the context of educationaltesting, although the statistical concepts are more general.Score distribution: The number (or the percentage) of test takers at each score level.Mean score: The average score, computed by summing the scores of all test takers and dividingby the number of test takers.Standard deviation: A measure of the amount of variation in a set of scores. It can beinterpreted as the average distance of scores from the mean. (Actually, it is a special kindof average called a “root mean square,” computed by squaring the distance of each scorefrom the mean score, averaging the squared distances, and then taking the square root.)Correlation: A measure of the strength and direction of the relationship between the scores ofthe same people on two tests.What Factors Influence a Test Score?Whenever a person takes a test, several factors influence the test taker’s score. The mostimportant factor (and usually the one with the greatest influence) is the extent to which the testtaker has the knowledge and skills that the test is supposed to measure. But the test taker’s scorewill often depend to some extent on other kinds of knowledge and skills, that the test is notsupposed to measure.Reading ability and writing ability often influence students’ scores on tests that are notintended to measure those abilities. Another influence is the collection of skills we call “testwiseness.” One such skill is using testing time efficiently. Another is knowing when and how toguess on a multiple-choice test. A kind of test-wiseness that is often useful on an essay test isRM-18-01 2

S. A. LivingstonTest Reliability—Basic Conceptsknowing how to include relevant knowledge you have, for which the question does notspecifically ask.One factor that can influence a test score is the test taker’s alertness and concentration onthe day of the test. In test taking, as in many other activities, most people perform better on somedays than on others. If you take a test on a day when you are alert and able to concentrate, yourscore is likely to be higher than it would be if you took it on a day when you were drowsy ordistracted.On most tests, the questions or problems that the test taker is confronted with are not theonly ones that could have been included. Different editions of the test include different questionsor problems intended to measure the same kinds of knowledge or skill. At some point in youreducation, you have probably been lucky enough to take a test that just happened to ask about thethings you knew. And you have probably had the frustrating experience of taking a test thathappened to include questions about several specific things you did not know. Very few testtakers (if any) would perform equally well on any set of questions that the test could include. Atest taker who is strong in the abilities the test is measuring will perform well on any edition ofthe test—but not equally well on every edition of the test.When a classroom teacher gives the students an essay test, typically there is only onerater—the teacher. That rater usually is the only user of the scores and is not concerned aboutwhether the ratings would be consistent with those of another rater. But when an essay test is partof a large-scale testing program, the test takers’ essays will not all be scored by the same rater.Raters in those programs are trained to apply a single set of criteria and standards in rating theessays. Still, a test taker’s essay might be scored by a rater who especially likes that test taker’swriting style or approach to that particular question. Or it might be scored by a rater whoparticularly dislikes the test taker’s style or approach. In either case, the rater’s reaction is likelyto influence the rating. Therefore, a test taker’s score can depend on which raters happened toscore that test taker’s essays. This factor affects any test that is scored by a process that involvesjudgment.The Luck of the DrawWhich of these influences on a test taker’s score can reasonably be assumed to beoperating effectively at random? Where does chance (“the luck of the draw”) enter into themeasurement process?RM-18-01 3

S. A. LivingstonTest Reliability—Basic ConceptsDoes chance affect the test taker’s level of the knowledge or skills that the test is intendedto measure? In the testing profession, we make a distinction between the test taker’s knowledgeof the specific questions on the test and the more general body of knowledge that those questionsrepresent. We believe that each test taker has a general level of knowledge that applies to any setof questions that might have appeared on that person’s test, and that this general level ofknowledge is not affected by chance.What we do consider as chance variation is the test taker’s ability to answer the specificquestions or solve the specific problems on the edition of the test that the test taker took. Wereason that the test taker could have been presented with a different set of questions or problemsthat met the specifications for the test. That set of questions or problems might have beensomewhat harder (or easier) for this test taker, even if they were not harder (or easier) for mostother test takers.Does chance affect the test taker’s level of other kinds of knowledge and skills that affecta person’s test score even though the test is not intended to measure them? Again, we make adistinction between the test taker’s general level of those skills and the effect of taking theparticular edition of the test that the test taker happened to take. We believe that a test taker’sgeneral level of these skills is not affected by chance, but the need for these skills could help tomake a particular edition of the test especially hard (or easy) for a particular test taker.What about the test taker’s alertness or concentration on the day of the test? We generallythink of it as a chance factor, because it affects different test takers’ scores differently andunpredictably. (That is, the effect is unpredictable from our point of view!)When different test takers’ essays are scored by different raters, we generally consider theselection of raters who score a test taker’s essay to be a chance factor. (The same reasoningapplies to other kinds of performance tests.) In some testing programs, we make sure thatselection of raters is truly a chance factor, by using a random process to assign responses toraters. But even when there is no true randomization, we think that the selection of raters shouldbe considered a chance factor affecting the test taker’s score.Reducing the Influence of Chance FactorsWhat can we testing professionals do to reduce the influence of these chance factors onthe test takers’ scores? How can we make our testing process yield scores that depend as little aspossible on the luck of the draw?RM-18-01 4

S. A. LivingstonTest Reliability—Basic ConceptsWe cannot do much to reduce the effect of day-to-day differences in a test taker’sconcentration and alertness (beyond advising test takers not to be tired or hungry on the day ofthe test). We could reduce the effect of these differences if we could give the test in several parts,each part on a different day, but such a testing procedure would not be practical for most tests.On most tests that have important consequences for the test taker, test takers who think theirperformance was unusually weak can retake the test, usually after waiting a specified time.There are some things we can do to reduce the effect of the specific selection of questionsor problems presented to the test taker. We can create detailed specifications for the content andformat of the test questions or problems, so that the questions on different forms will measure thesame set of knowledge and skills. We can avoid reporting scores based on only a few multiplechoice questions or problems. A

with the following basic statistical concepts, at least to the extent of knowing and understanding the definitions given below. These definitions are all expressed in the context of educational testing, although the statistical concepts are more general. Score distribution: The numbe

Related Documents:

Test-Retest Reliability Alternate Form Reliability Criterion-Referenced Reliability Inter-rater reliability 4. Reliability of Composite Scores Reliability of Sum of Scores Reliability of Difference Scores Reliability

Reliability Infrastructure: Supply Chain Mgmt. and Assessment Design for reliability: Virtual Qualification Software Design Tools Test & Qualification for reliability: Accelerated Stress Tests Quality Assurance System level Reliability Forecasting: FMEA/FMECA Reliability aggregation Manufacturing for reliability: Process design Process variability

Jan 11, 2017 · – Assess the reliability of engineering systems – Apply concepts of the probability theory for power systems reliability evaluation – Do basic studies of power generation and transmission reliability – Analyze reliability of distribution electricity networks – Design (and expand) a

posing system reliability into component reliability in a deterministic manner (i.e., series or parallel systems). Consequentially, any popular reliability analysis tools such as Fault Tree and Reliability Block Diagram are inadequate. In order to overcome the challenge, this dissertation focuses on modeling system reliability structure using

Evidence Brief: Implementation of HRO Principles Evidence Synthesis Program. 1. EXECUTIVE SUMMARY . High Reliability Organizations (HROs) are organizations that achieve safety, quality, and efficiency goals by employing 5 central principles: (1) sensitivity to operations (ie, heightenedFile Size: 401KBPage Count: 38Explore furtherVHA's HRO journey officially begins - VHA National Center .www.patientsafety.va.govHigh-Reliability Organizations in Healthcare: Frameworkwww.healthcatalyst.comSupporting the VA’s high reliability organization .gcn.com5 Principles of a High Reliability Organization (HRO)blog.kainexus.com5 Traits of High Reliability Organizations: How to .www.beckershospitalreview.comRecommended to you b

Electronic Parts Reliability Data (2000 pages) Nonelectronic Parts Reliability Data (1000 pages) Nonoperating Reliability Databook (300 pages) Recipe books: Recipe book: MIL-HDBK-217F Military Handbook 338B: Electronic Reliability Design Handbook Automotive Electronics Reliability SP-696 Reliability references:

Electronic Parts Reliability Data (2000 pages) Nonelectronic Parts Reliability Data (1000 pages) Nonoperating Reliability Databook (300 pages) Recipe books: Recipe book: MIL-HDBK-217F Military Handbook 338B: Electr onic Reliability Design Handbook Automotive Electronics Reliability SP-696 Reliability references:

Keywords: Reliability Block Diagrams (RBD); hierarchical reliability model; reliability curve; reliabil-ity evaluation; software libraries 1. Introduction Reliability is defined as "the ability of a system or component to perform its required functions under stated conditions for a specified period of time" [1]. Reliability is often