ROUGE: A Package For Automatic Evaluation Of Summaries - Semantic Scholar

1y ago

26 Views

2 Downloads

1.85 MB

24 Pages

Last View : 1d ago

Last Download : 3m ago

Upload by : Kaden Thurman

Report this link

Download PDF

Transcription

ROUGE: A Package forAutomatic Evaluation ofSummariesChin-Yew LinInformation Sciences InstituteUniversity of Southern CaliforniaChin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Summarization Evaluation Basic assumptions– We know how to summarize.– We know what a good summary should be. The reality– Everyone summarizes.– Everyone has his/her own good summary. The question– Is objective evaluation of summarization possible, ifeveryone has his/her own good summary?Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

MT and Summarization Evaluations Machine Translation– Inputs Reference translation Candidate translation– Methods Manually compare twotranslations in:– Adequacy– Fluency– Informativeness Auto evaluation using:– BLEU/NIST scores Auto Summarization– Inputs Reference summary Candidate summary– Methods Manually compare twosummaries in:– Content overlap– Linguistic qualities Auto evaluation?Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Document UnderstandingConference (DUC) Part of US DARPA TIDES Project DUC 01 - 04 (http://duc.nist.gov)– Tasks Single-doc summarization (DUC 01 and 02: 30 topics) Single-doc headline generation (DUC 03: 30 topics, 04: 50 topics) Multi-doc summarization– Generic 10, 50, 100, 200 (2002) , and 400 (2001) words summaries– Short summaries of about 100 words in three different tasks in 2003» focused by an event (30 TDT clusters)» focused by a viewpoint (30 TREC clusters)» in response to a question (30 TREC Novelty track clusters)– Short summaries of about 665 bytes in three different tasks in 2004» focused by an event (50 TDT clusters)» focused by an event but documents were translated into English fromArabic (24 topics)» in response to a “who is X?” question (50 persons)– Participants 15 systems in DUC 2001, 17 in DUC 2002, 21 in DUC 2003, and 25 in DUC2004 A new 3-year roadmap will be released during the summer.Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

DUC 2003 Human vs. Human (1)Nenkova and Passonneau (HLT/NAACL 2004)Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

DUC 2003 Human vs. Human (2)1. Can we get consensus among humans?2. If yes, how many humans do we need to get consensus?3. Single reference or multiple references?Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

DUC 2003 Human vs. Human (3)Can we get stable estimation of human or system performance?How many samples do we need to achieve this?Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Summary of Research Issues How to accommodate human inconsistency? Can we obtain stable evaluation results despiteusing only a single reference summary perevaluation? Will inclusion of multiple summaries makeevaluation more or less stable? How can multiple references be used inimproving stability of evaluations? How is stability of evaluation affected by samplesize?Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Recent Results Van Halteren and Teufel (2003)– Stable consensus factoid summary could be obtained if 40 to 50reference summaries were considered. 50 manual summaries of one text. Nenkova and Passonneau (2003)– Stable consensus semantic content unit (SCU) summary couldbe obtained if at least 5 reference summaries were used. 10 manual multi-doc summaries for three DUC 2003 topics. Hori et al. (2003)– Using multiple references would improve evaluation stability if ametric taking into account consensus. 50 utterances in Japanese TV broadcast news; each with 25manual summaries. Lin and Hovy (2003), Lin (2004)– ROUGE, an automatic evaluation method used in summarization(DUC 2004) and MT (Lin and Och, ACL, COLING 2004).Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Automatic Evaluation ofSummarization Using ROUGE ROUGE summarization evaluation package– Currently (v1.4.2) include the following automaticevaluation methods: ROUGE-N: N-gram based co-occurrence statistics ROUGE-L: LCS-based statistics ROUGE-W: Weighted LCS-based statistics that favorsconsecutive LCSes (see ROUGE note) ROUGE-S: Skip-bigram-based co-occurrence statistics ROUGE-SU: Skip-bigram plus unigram-based co-occurrencestatistics– Free download for research purpose at:http://www.isi.edu/ cyl/ROUGEChin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-N N-gram co-occurrences between reference andcandidate translations.– Similar to BLEU in MT (Papineni et al. 2001) High order ROUGE-N with n-gram lengthgreater than 1 estimates the fluency ofsummaries. Example:1.police killed the gunman2.police kill the gunman3.the gunman kill policeROUGE-N: S2 S3 (“police”, “the gunman”)Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-L Longest Common Subsequence (LCS)– Given two sequences X and Y, a longestcommon subsequence of X and Y is acommon subsequence with maximum length.– Intuition The longer the LCS of two translations is, themore similar the two translations are. (Saggion etal. 2002, MEAD)– Score Use LCS-based recall score (ROUGE-L) to estimatethe similarity between two translations. (seepaper for more details)Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-L Example Example:1.police killed the gunman2.police kill the gunman3.the gunman kill police ROUGE-N: S2 S3 (“police”, “the gunman”) ROUGE-L:– S2 3/4 (“police the gunman”)– S3 2/4 (“the gunman”)– S2 S3Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-W Weighted Longest Common Subsequence– Example: X: [A B C D E F G] Y1: [A B C D H I K] Y2: [A H B K C I D] ROUGE-L(Y1) ROUGE-L(Y2)– ROUGE-W favors strings with consecutivematches.– It can be computed efficiently using dynamicprogramming.Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-S Skip-Bigram– Any pair of words in their sentence order,allowing for arbitrary gaps.– Intuition Consider long distance dependency. Allow gaps in matches as LCS but count all insequence pairs; while LCS only counts the longestsubsequences.– Score Use skip-bigram-based recall score (ROUGE-S) toestimate the similarity between two translations.(see paper for more details)Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

ROUGE-S Example Example:1.police killed the gunman2.police kill the gunman3.the gunman kill police4.the gunman police killed ROUGE-N: S4 S2 S3 ROUGE-L: S2 S3 S4 ROUGE-S:––––S2 3/6 (“police the”, “police gunman”, “the gunman”)S3 1/6 (“the gunman”)S4 2/6 (“the gunman”, “police killed”)S2 S4 S3Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Evaluation of ROUGE Corpora Seven task formats Three versions Number of references Quality criterion Metrics Statistical significance– DUC 01, 02, and 03 evaluation data– Including human and systems summaries– Single doc 10 and 100 words, multi-doc 10, 50, 100, 200, and 400 words– CASE: the original summaries– STEM: the stemmed version of summaries– STOP: STEM plus removal of stopwords– Single and different numbers of multiple references– Pearson’s product moment correlation coefficients between systems’ averageROUGE scores and their human assigned mean coverage score– 17 ROUGE metrics: ROUGE-N with N 1 to 9, ROUGE-L, ROUGE-W, ROUGE-Sand ROUGE-SU (with maximum skip-distance of 0, 4, and 9)– 95% confidence interval estimated using bootstrap resamplingChin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

100 Words Single-Doc TaskDUC 2001 100 WORDS SINGLE DOC1 REF3 REFSMethod CASE STEM STOP CASE STEM STOPR-10.76 0.76 0.84 0.80 0.78 0.84R-20.84 0.84 0.83 0.87 0.87 0.86R-30.82 0.83 0.80 0.86 0.86 0.85R-40.81 0.81 0.77 0.84 0.84 0.83R-50.79 0.79 0.75 0.83 0.83 0.81R-60.76 0.77 0.71 0.81 0.81 0.79R-70.73 0.74 0.65 0.79 0.80 0.76R-80.69 0.71 0.61 0.78 0.78 0.72R-90.65 0.67 0.59 0.76 0.76 0.69R-L0.83 0.83 0.83 0.86 0.86 0.86R-S*0.74 0.74 0.80 0.78 0.77 0.82R-S40.84 0.85 0.84 0.87 0.88 0.87R-S90.84 0.85 0.84 0.87 0.88 0.87R-SU*0.74 0.74 0.81 0.78 0.77 0.83R-SU40.84 0.84 0.85 0.87 0.87 0.87R-SU90.84 0.84 0.85 0.87 0.87 0.87R-W-1.2 0.85 0.85 0.85 0.87 0.87 0.87DUC 2002 100 WORDS SINGLE DOC1 REF2 REFSCASE STEM STOP CASE STEM STOP0.98 0.98 0.99 0.98 0.98 0.990.99 0.99 0.99 0.99 0.99 0.990.99 0.99 0.99 0.99 0.99 0.990.99 0.99 0.98 0.99 0.99 0.990.99 0.99 0.98 0.99 0.99 0.980.98 0.99 0.97 0.99 0.99 0.980.98 0.98 0.97 0.99 0.99 0.970.98 0.98 0.96 0.99 0.99 0.970.97 0.97 0.95 0.98 0.98 0.960.99 0.99 0.99 0.99 0.99 0.990.98 0.98 0.98 0.98 0.97 0.980.99 0.99 0.99 0.99 0.99 0.990.99 0.99 0.99 0.99 0.99 0.990.98 0.98 0.98 0.98 0.98 0.980.99 0.99 0.99 0.99 0.99 0.990.99 0.99 0.99 0.99 0.99 0.990.99 0.99 0.99 0.99 0.99 0.99Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

10 Words Single-Doc TaskDUC 2003 10 WORDS SINGLE DOC1 REF 4REFS 1 REF 4 REFS 1 REF 4 60.940.970.95R-W-1.2 0.960.960.960.960.960.96Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

100 Words Multi-Doc Task(A1) DUC 2001 100 WORDS MULTICASE1 20.800.800.520.550.530.60Method(A2) DUC 2002 100 WORDS MULTI3 REFSSTEM STOP(A3) DUC 2003 100 WORDS MULTICASE1 REFSTEMSTOPCASE2 REFSSTEM 0.680.580.680.680.580.760.750.58CASE1 REFSTEM STOPChin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004CASE4 REFSSTEM STOP

Multi-Doc Task of DifferentSummary -S9R-SU*R-SU4R-SU9R-W-1.2(C) DUC02 10(D1) DUC01 50(D2) DUC02 50(E1) DUC01 200(E2) DUC02 200(F) DUC01 400CASE STEM STOP CASE STEM STOP CASE STEM STOP CASE STEM STOP CASE STEM STOP CASE STEM 890.900.850.900.890.90Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 860.890.880.870.86

Summary of Results Overall– Using multiple references achieved better correlation with humanjudgment than just using a single reference.– Using more samples achieved better correlation with human judgment(DUC 02 vs. other DUC data).– Stemming and removing stopwords improved correlation with humanjudgment.– Single-doc task had better correlation than multi-doc Specific– ROUGE-S4, S9, and ROUGE-W1.2 were the best in 100 words singledoc task, but were statistically indistinguishable from most otherROUGE metrics.– ROUGE-1, ROUGE-L, ROUGE-SU4, ROUGE-SU9, and ROUGE-W1.2worked very well in 10 words headline like task (Pearson’s ρ 97%).– ROUGE-1, 2, and ROUGE-SU* were the best in 100 words multi-doctask but were statistically equivalent to other ROUGE-S and SU metrics.– ROUGE-1, 2, ROUGE-S, and SU worked well in other multi-doc tasks.Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Ongoing Work Summary and sentence level error analysis– Summary level Evaluate techniques used in ETS’ E-Rater and itssuccessors in automatic evaluation of summaries.– Sentence level Matching at concept level instead of lexical level:– Synonyms and paraphrases– Utilize consensus in reference summaries Matching at syntactic level– Dependency structure based co-occurrence statistics Large scale reference summary corpuscreationChin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Q&AThank You!Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004

Chin-Yew Lin, Workshop on Text Summarization Branches Out, Barcelona, Spain, July 25 - 26, 2004 ROUGE: A Package for Automatic Evaluation of Summaries Chin-Yew Lin . (2002) , and 400 (2001) words summaries -Short summaries of about 100 words in three different tasks in 2003 »focused by an event (30 TDT clusters) . Number of .

Related Documents:

Thesis Advisor: Sanjay Batra, Ph.D. - Southern University

Arteikia Oceianna Harrell, Baton Rouge MASTER OF BUSINESS ADMINISTRATION Desmond Chase, Baton Rouge Myesha M. Holliday, Ethel La'Quintha Monet Newman, Baton Rouge Donovan Williams, Baton Rouge Erica Byrd Williams, Baton Rouge Gideon Woodson-Levy, Columbus, OH Biology Italy Hayes, Baton Rouge CRIMINAL JUSTICE Katrina Monisa' Dabney, Baton Rouge

8 Views

8m ago

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

372 Views

1y ago

Comprehensive Housing Market Analysis: Baton Rouge, Louisiana - HUD User

The Baton Rouge Housing Market Area (HMA), in southeast Louisiana, is . coterminous with the Baton Rouge, LA Metropolitan Statistical Area (MSA) and consists of nine parishes: Ascen- sion, East Baton Rouge, East Feliciana, Iberville, Livingston, Pointe Coupee, St. Helena, West Baton Rouge, and West Feliciana. The city of Baton Rouge is

7 Views

11m ago

10 tips och tricks för att lyckas med ert sap-projekt

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

733 Views

2y ago

Nordens 25 största medieföretag efter omsättning

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

329 Views

1y ago

SS 02 52 68 Ljudklassning av utrymmen i byggnader - byggtjanst.se

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

354 Views

1y ago

Apple Developer Program License Agreement (Swedish)

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

342 Views

1y ago

East Baton Rouge Parish School System Baton Rouge, Louisiana

Baton Rouge, Louisiana ANNUAL COMPREHENSIVE FINANCIAL REPORT For the year ended June 30, 2021 2020-2021 . East Baton Rouge Parish School System Baton Rouge, Louisiana For the Year Ended June 30, 2021 Prepared by the Finance and Budget Management Staff James P. Crochet, CPA

9 Views

11m ago

Recent Views

TENTH EDITION self-therapy for the stutterer

Stuttering Foundation of America self-therapy for the stutterer TENTH EDITION THE STUTTERING FOUNDATION PUBLICATION NO. 0012 self-therapy for the stutterer Publication No. 0012 First Edition—1978 Tenth Edition—2002 Revised Tenth Edition—2007 Published by Stuttering Foundation of America 3100 Walnut Grove Road, Suite 603 P.O. Box 11749 Memphis, Tennessee 38111-0749 Library of Congress .

3y ago

40 Views

Supply Chain Management: An International Journal

The organization is a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation. *Related content and download information correct at time of download. Downloaded by University of Nottingham At 06:12 31 October 2018 (PT) Modern slavery challenges to supply chain management Stefan Gold International Centre for .

3y ago

29 Views

Operation London Bridge - Fremington Parish Council

OPERATION LONDON BRIDGE . 1 CONTENTS Page 2 – 1. Introduction Page 3 – 2. Protocol Page 3 – 2.1 Implementation of Protocol Page 3 – 3. Flag Flying Page 3 – 4. Proclamation Day Schedule Page 4 – 4.1 Proclamation Day Page 4 – 4.2 Proclamation Day Protocol Page 5 – 5. Books of Condolence Page 6 – 5.1 Online Book of Condolence Page 6 – 6. Events During the Period of Mourning .

3y ago

62 Views

A CONTINUUM OF QUALITY: ON FIRE

ASTM D 5132 BSS 7230 MODEL 701-S MODEL 701-S-X (export) MODEL VC-1 MODEL VC-1-X (export) MODEL VC-2 MODEL VC-2-X (export) MODEL HC-1 MODEL HC-1-X (export) MODEL HC-2 MODEL HC-2-X (export) FAA Listed TM. FAA MULTI-PURPOSE SMALL SCALE FLAMMABILITY TESTER SPECIFICATIONS: FAR Part 25 Appendix F Part I (Vertical, Horizontal, 45 and 60 ) DRAPERY FLAMMABILITY The most widely cited .

3y ago

80 Views

Combustion Analysis of Nanoenergetic Materials

Osci 1 05 10 15 P a [MPa] Acc Osci. NEEM MURI Temperature Measurements for understanding Gas Generation Previous work: gas fraction at equilibrium Drawbacks: No intermediate gases (not present at equilibrium) nAl/MoO 3 30 Many of the equilibrium gases will not be realized until very high temperatures (ex. Cu: BP of 2835K) nAl/CuO in burn tube at 10 20 e ssure [MPa] 1atm in air nAl/MoO .

3y ago

37 Views

Wiring and testing electrical equipment and circuits

circuits to occur, strain on terminations, insufficient slack cable at terminations, continuity and polarity checks, insulation checks) K21 the care, handling and application of electrical test and measuring instruments (such as multimeter, insulation resistance tester, loop impedance test instruments) K22 applying approved test procedures; the safe working practices and procedures required .

3y ago

46 Views

GRID DIP METER DESIGN - makearadio

circuits). 2. Rough frequency and harmonic measurements 3. AM signal monitor receiver. 4. Simple RF signal generator including AM modulation if required. 5. Crystal Testing. 6. Use as a BFO for SSB and CW reception 7. Measurement of unknown capacitors and inductors I decided to include some extra features above the normal in functionality RF output from the oscillator enabling use of an .

3y ago

208 Views

OPHTHALMOLOGY GOALS AND OBJECTIVES

The objectives of Ophthalmology Residency Program are to: 1. Provide residents with a strong scientific understanding of the fundamentals of ophthalmology through a combination of mentoring and didactic education. 2. Provide residents with clinical skills in all subspecialties of ophthalmology. 3.

3y ago

60 Views

History of Computers

An analog computer does not store information digitally Values are stored as voltage levels Analog computers are particularly useful solving nonlinear simultaneous differential equations An electric circuit can be defined by an equation. An analog computer is programmed by creating a circuit that follows a desired equation.

3y ago

37 Views

Risk Management and Corporate Governance - OECD

Corporate Governance Risk Management and Corporate Governance Volume 2011/Number of issue,Year of edition Author (affiliation or title), Editor Tagline Groupe de travail/Programme (ligne avec top à 220 mm)

3y ago

66 Views

RF Design and Test Using MATLAB and NI Tools

RF Design and Test Using MATLAB and NI Tools . Antenna array, RF, and digital signal processing cannot be designed separately! – Large communication bandwidth digital signal processing is challenging – High-throughput DSP linearity requirements imposed over large bandwidth

3y ago

87 Views

Digital Signal Processing - Webspaces - Accueil

J.-P. Delmas et al. / Digital Signal Processing 95 (2019) 102579. lower far-ﬁeld DOA CRB. Furthermore, thanks to the decoupling be-tween the DOA and range parameters to the second-order w.r.t. the inverse of the range in the Fisher information matrix, the deriva-tion of closed-form approximate expressions of the CRB is greatly simpliﬁed.

3y ago

23 Views

History of U.S. Children’s Policy, 1900-Present

Social dislocations of the late 19th century, sparked by rapid industrialization, population growth, urbanization, and immigration, together with the economic crises of the late 1870s and 1890s, led to social reform movements in the 1890s and during the Progressive Era at the beginning of the 20th century. With respect to children, many reformers

3y ago

53 Views

EDUKASYONG PANGKATAWAN 5 Lesson Exemplars Karapatang Ari .

nakasaad sa ilalim ng makabagong kurikulum, ang K to 12 Currriculum. Layunin nito na mabigyan ng sapat na kaalaman at pagpapahalaga sa mga gawaing may kinalaman sa pagpapaunlad ng pangangatawan. Sa paghahanda ng mga aralin na nakapaloob sa exemplar na ito, isinasaalang-alang ang mga sumusunod na pangunahing kaisipan:

3y ago

99 Views

ELECTRICAL ENGINEERING GRADUATE

Electrical Engineering, or is not equivalent to the BSEE degree offered by Cal State LA, we may require you to complete certain prerequisite courses before being admitted to our program. These will normally be 300level courses, though the list mig0- ht contain a number of 2 or 400000-0-

3y ago

30 Views

ROUGE: A Package For Automatic Evaluation Of Summaries - Semantic Scholar

It looks like you're using an ad-blocker