CS3300 - Compiler Design - Parsing

2y ago
33 Views
5 Downloads
298.34 KB
25 Pages
Last View : 12d ago
Last Download : 12d ago
Upload by : Jenson Heredia
Transcription

AcknowledgementThese slides borrow liberal portions of text verbatim from Antony L.Hosking @ Purdue, Jens Palsberg @ UCLA and the Dragon book.CS3300 - Compiler DesignParsingV. Krishna NandivadaIIT MadrasCopyright c 2019 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies bear this notice and full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and/or fee. Request permission to publish from hosking@cs.purdue.edu.*V.Krishna Nandivada (IIT Madras)The role of the parsersourcecode2 / 98Syntax analysis by using a CFGscannertokensparserContext-free syntax is specified with a context-free grammar.Formally, a CFG G is a 4-tuple (Vt , Vn , S, P), where:Vt is the set of terminal symbols in the grammar.For our purposes, Vt is the set of tokens returned by thescanner.Vn , the nonterminals, is a set of syntactic variables thatdenote sets of (sub)strings occurring in the language.These are used to impose a structure on the grammar.S is a distinguished nonterminal (S Vn ) denoting the entireset of strings in L(G).This is sometimes called a goal symbol.P is a finite set of productions specifying how terminals andnon-terminals can be combined to form strings in thelanguage.Each production must have a single non-terminal on itsleft hand side.The set V Vt Vn is called the vocabulary of G.IRerrorsA parserperforms context-free syntax analysisguides context-sensitive analysisconstructs an intermediate representationproduces meaningful error messagesattempts error correctionFor the next several classes, we will look at parser construction*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019CS3300 - Aug 2019*3 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 20194 / 98

Notation and terminologySyntax analysisGrammars are often written in Backus-Naur form (BNF).Example:1 hgoali :: hexpri2 hexpri :: hexprihopihexpri3 num id45 hopi :: 67 8 /a, b, c, . . . VtA, B, C, . . . VnU, V, W, . . . Vα, β , γ, . . . V u, v, w, . . . Vt If A γ then αAβ αγβ is a single-step derivation using A γSimilarly, and denote derivations of 0 and 1 stepsThis describes simple expressions over numbers and identifiers.In a BNF for a grammar, we representIf S β then β is said to be a sentential form of GL(G) {w Vt S w}, w L(G) is called a sentence of GNote, L(G) {β V S β } Vt 1non-terminals with angle brackets or capital letters2terminals with typewriter font or underline3productions as in the example*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*5 / 98DerivationsV.Krishna Nandivada (IIT Madras)CS3300 - Aug 20196 / 98DerivationsWe can view the productions of a CFG as rewriting rules.Using our example CFG (for x 2 y):hgoali We have derived the sentence x 2 y.We denote this hgoali id num id.Such a sequence of rewrites is a derivation or a parse.The process of discovering a derivation is called parsing.V.Krishna Nandivada (IIT Madras)At each step, we chose a non-terminal to replace.This choice can lead to different derivations.Two are of particular ,xi hexprihid,xi hexprihopihexprihid,xi hnum,2ihopihexprihid,xi hnum,2i hexprihid,xi hnum,2i hid,yiCS3300 - Aug 2019leftmost derivationthe leftmost non-terminal is replaced at each steprightmost derivationthe rightmost non-terminal is replaced at each stepThe previous example was a leftmost derivation.**7 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 20198 / 98

Rightmost derivationPrecedenceFor the string x 2 y:hgoali Again,hgoali goalhexprihexprihopihexprihexprihopihid,yihexpri hid,yihexprihopihexpri hid,yihexprihopihnum,2i hid,yihexpri hnum,2i hid,yihid,xi hnum,2i hid,yiexprexprid num id.CS3300 - Aug 2019opexpr id,x num,2 9 / 98V.Krishna Nandivada (IIT Madras)PrecedencePrecedenceThese two derivations point out a problem with the grammar.It has no notion of precedence, or implied order of evaluation.To add precedence takes additional machinery:Now, for the string x 2 y:1 hgoali2 hexpri345 htermi678 hfactori9:: :: :: :: V.Krishna Nandivada (IIT Madras)hgoali hexprihexpri htermihexpri htermihtermihtermi hfactorihtermi/hfactorihfactorinumidThis grammar enforces a precedence on the derivation:terms must be derived from expressionsforces the “correct” treeCS3300 - Aug 2019expr* id,y Treewalk evaluation computes (x 2) y— the “wrong” answer!Should be x (2 y)*V.Krishna Nandivada (IIT Madras)exprop*CS3300 - Aug 201910 / 98hexprihexpri htermihexpri htermi hfactorihexpri htermi hid,yihexpri hfactori hid,yihexpri hnum,2i hid,yihtermi hnum,2i hid,yihfactori hnum,2i hid,yihid,xi hnum,2i hid,yiAgain, hgoali id num id, but this time, we build the desired tree.**11 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201912 / 98

PrecedenceAmbiguitygoalIf a grammar has more than one derivation for a single sentential form,then it is ambiguousExample:hstmti :: if hexprithen hstmti if hexprithen hstmtielse hstmti other stmtsConsider deriving the sentential form:exprexpr termtermterm*factorfactor id,x num,2 It has two derivations.This ambiguity is purely grammatical.It is a context-free ambiguity. id,y Treewalk evaluation computes x (2 y)V.Krishna Nandivada (IIT Madras)if E1 then if E2 then S1 else S2factor*CS3300 - Aug 2019*13 / 98AmbiguityV.Krishna Nandivada (IIT Madras)CS3300 - Aug 201914 / 98AmbiguityMay be able to eliminate ambiguities by rearranging the grammar:hstmti:: hmatchedi hunmatchedihmatchedi:: if hexpri then hmatchedi else hmatchedi other stmtshunmatchedi :: if hexpri then hstmti if hexpri then hmatchedi else hunmatchediAmbiguity is often due to confusion in the context-free specification.Context-sensitive confusions can arise from overloading.Example:a f(17)In many Algol/Scala-like languages, f could be a function orsubscripted variable. Disambiguating this statement requires context:need values of declarationsnot context-freereally an issue of typeThis generates the same language as the ambiguous grammar, butapplies the common sense rule:match each else with the closest unmatched thenRather than complicate parsing, we will handle this separately.This is most likely the language designer’s intent.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*15 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201916 / 98

Scanning vs. parsingParsing: the big pictureWhere do we draw the line?term :: [a zA z]([a zA z] [0 9]) 0 [1 9][0 9] op:: /expr :: (term op) termtokensRegular expressions are used to classify:identifiers, numbers, keywordsREs are more concise and simpler for tokens than a grammarmore efficient scanners can be built from REs (DFAs) thangrammarsContext-free grammars are used to count:brackets: (), begin. . . end, if. . . then. . . elseimparting structure: expressionsgrammarSyntactic analysis is complicated enough: grammar for C has around 200productions. Factoring out lexical analysis as a separate phase makes*compiler more manageable.V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019parsergeneratorparsercodeIROur goal is a flexible parser generator system17 / 98Different ways of parsing: Top-down Vs Bottom-upV.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019A top-down parser starts with the root of the parse tree, labelled withthe start or goal symbol of the grammar.To build a parse, it repeats the following steps until the fringe of theparse tree matches the input stringstart at the root of derivation tree and fill inpicks a production and tries to match the inputmay require backtracking1At a node labelled A, select a production A α and construct theappropriate child for each symbol of α2When a terminal is added to the fringe that doesn’t match theinput string, backtrack3Find next node to be expanded (must have a label in Vn )some grammars are backtrack-free (predictive)Bottom-up parsersstart at the leaves and fill instart in a state valid for legal first tokensThe key is selecting the right production in step 1.as input is consumed, change state to encode possibilities(recognize valid prefixes)If the parser makes a wrong step, the “derivation” process does notterminate.Why is it bad?use a stack to store both state and sentential forms*CS3300 - Aug 201918 / 98Top-down parsingTop-down parsersV.Krishna Nandivada (IIT Madras)**19 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201920 / 98

Left-recursionEliminating left-recursionTo remove left-recursion, we can transform the grammarConsider the grammar fragment:Top-down parsers cannot handle left-recursion in a grammarFormally, a grammar is left-recursive ifhfooi :: hfooiα β A Vn such that A Aα for some string αwhere α and β do not start with hfooiWe can rewrite this as:hfooi :: β hbarihbari :: αhbari εOur simple expression grammar is left-recursivewhere hbari is a new non-terminalThis fragment contains no left-recursion*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*21 / 98How much lookahead is needed?V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201922 / 98Predictive parsingWe saw that top-down parsers may need to backtrack when theyselect the wrong productionDo we need arbitrary lookahead to parse CFGs?in general, yesuse the Earley or Cocke-Younger, Kasami algorithmsBasic idea:For any two productions A α β , we would like a distinct way ofchoosing the correct production to expand.For some RHS α G, define FIRST(α) as the set of tokens thatappear first in some string derived from α.Fortunatelylarge subclasses of CFGs can be parsed with limited lookaheadThat is, for some w Vt , w FIRST(α) iff. α wγ.most programming language constructs can be expressed in agrammar that falls in these subclassesKey property:Whenever two productions A α and A β both appear in thegrammar, we would likeAmong the interesting subclasses are:LL(1): left to right scan, left-most derivation, 1-token lookahead;andLR(1): left to right scan, reversed right-most derivation, 1-tokenlookaheadFIRST (α) FIRST (β ) φThis would allow the parser to make a correct choice with alookahead of only one symbol!*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*23 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201924 / 98

Left factoringExampleWhat if a grammar does not have this property?Sometimes, we can transform a grammar to have this property.There are two non-terminalsto left factor:hexpri :: htermi hexpri htermi hexpri htermiFor each non-terminal A find the longest prefixα common to two or more of its alternatives.if α 6 ε then replace all of the A productionsA αβ1 αβ2 · · · αβnwithA αA0A0 β1 β2 · · · βnwhere A0 is a new non-terminal.htermi :: hfactori htermi hfactori/htermi hfactoriApplying the transformation:hexpri :: htermihexpr0 ihexpr0 i :: hexpri hexpri εhtermi :: hfactorihterm0 ihterm0 i :: htermi /htermi εRepeat until no two alternatives for a singlenon-terminal have a common prefix.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*25 / 98Indirect Left-recursion elimination12345678910V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019GeneralityQuestion:By left factoring and eliminating left-recursion, can wetransform an arbitrary context-free grammar to a form where itcan be predictively parsed with a single token lookahead?Given a left-factored CFG, to eliminate left-recursion:Input: Grammar G with no cycles and no ε productions.Output: Equivalent grammat with no left-recursion. beginArrange the non terminals in some order A1 , A2 , · · · An ;foreach i 1 · · · n doforeach j 1 · · · i 1 doSay the ith production is: Ai Aj γ ;and Aj δ1 δ2 · · · δk ;Replace, the ith production by:Ai δ1 γ δ2 γ · · · δn γ;Answer:Given a context-free grammar that doesn’t meet ourconditions, it is undecidable whether an equivalent grammarexists that does meet our conditions.Many context-free languages do not have such a grammar:{an 0bn n 1} {an 1b2n n 1}Eliminate immediate left recursion in Ai ;Must look past an arbitrary number of a’s to discover the 0 or the 1 andso determine the derivation.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201926 / 98*27 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201928 / 98

Recursive descent parsing1234567891011121314Recursive descent parsingint A()beginforeach production of the form A X1 X2 X3 · · · Xk dofor i 1 to k doif Xi is a non-terminal thenif (Xi () 6 0) thenbacktrack; break; // Try the next productionBacktracks in general – in practise may not do much.How to backtrack?else if Xi matches the current input symbol a thenadvance the input to the next symbol;A left recursive grammar will lead to infinite loop.elsebacktrack; break; // Try the next productionif i EQ k 1 thenreturn 0; // Successreturn 1; // Failure*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*29 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019Non-recursive predictive parsingTable-driven parsersNow, a predictive parser looks like:A parser generator system often looks parsersourcecodeIRgrammarparsingtables*CS3300 - Aug IRparsingtablesThis is true for both top-down (LL) and bottom-up (LR) parsersThis also uses a stack – but mainly to remember part of the inputstring; no recursion.Rather than writing recursive code, we build tables.Why?Building tables can be automated, easily.V.Krishna Nandivada (IIT Madras)30 / 98*31 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201932 / 98

FIRSTFOLLOWFor a string of grammar symbols α, define FIRST(α) as:the set of terminals that begin strings derived from α:{a Vt α aβ }If α ε then ε FIRST(α)For a non-terminal A, define FOLLOW(A) ascontains the tokens valid in the initial position in αTo build FIRST(X):1If X Vt then FIRST(X) is {X}2If X ε then add ε to FIRST(X)3If X Y1 Y2 · · · Yk :Thus, a non-terminal’s FOLLOW set specifies the tokens that canlegally appear after it.A terminal symbol has no FOLLOW set.To build FOLLOW(A):1Put in FOLLOW(hgoali)2If A αBβ :the set of terminals that can appear immediately to the rightof A in some sentential formFIRST (α)123Put FIRST(Y1 ) {ε} in FIRST(X) i : 1 i k, if ε FIRST(Y1 ) · · · FIRST(Yi 1 )(i.e., Y1 · · · Yi 1 ε)then put FIRST(Yi ) {ε} in FIRST(X)If ε FIRST(Y1 ) · · · FIRST(Yk ) then put ε in FIRST(X)12Put FIRST(β ) {ε} in FOLLOW(B)If β ε (i.e., A αB) or ε FIRST(β ) (i.e., β ε) then putFOLLOW(A) in FOLLOW(B)Repeat until no more additions can be madeRepeat until no more additions can be made.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*33 / 98LL(1) grammarsV.Krishna Nandivada (IIT Madras)CS3300 - Aug 201934 / 98LL(1) grammarsPrevious definitionA grammar G is LL(1) iff. for all non-terminals A, each distinctpair of productions A β and A γ satisfy the conditionTFIRST (β ) FIRST (γ) φ .Provable facts about LL(1) grammars:123What if A ε?Revised definitionA grammar G is LL(1) iff. for each set of productionsA α1 α2 · · · αn :1FIRST(α1 ), FIRST(α2 ), . . . , FIRST (αn ) are all pairwisedisjoint2If αi ε TthenFIRST(αj ) FOLLOW (A) φ , 1 j n, i 6 j.4No left-recursive grammar is LL(1)No ambiguous grammar is LL(1)Some languages have no LL(1) grammarA ε–free grammar where each alternative expansion for A beginswith a distinct terminal is a simple LL(1) grammar.ExampleS aS a is not LL(1) because FIRST(aS) FIRST(a) {a}S aS0S0 aS0 εaccepts the same language and is LL(1)If G is ε-free, condition 1 is sufficient.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*35 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201936 / 98

LL(1) parse table constructionExampleOur long-suffering expression grammar:1. S2. E3. E04.5.Input: Grammar GOutput: Parsing table MMethod:1 productions A α:12 a FIRST(α), add A α to M[A, a]If ε FIRST(α):122FIRSTSEE0TT0Fidnum / b FOLLOW(A), add A α to M[A, b]If FOLLOW(A) then add A α to M[A, ]Set each undefined entry of M to errorIf M[A, a] with multiple entries then grammar is not LL(1).Note: recall a, b Vt , so a, b 6 ε*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201937 / 98Table driven Predictive parsing12345678910else if X is a terminal thenerror();else if M[X, a] is an error entry thenerror();1415X stack.top();12136. T7. T 08.9.10. F11.FOLLOWnum, id num, id ε, , num, id , , ε, , / , , num, id , , , /, id num / V.Krishna Nandivada (IIT Madras)id num / 11 22 3 4 566 9 9 7 8 911 10 CS3300 - Aug 2019*CS3300 - Aug 201938 / 98hstmti :: if hexpri then hstmti if hexpri then hstmti else hstmti .Left-factored: hstmti :: if hexpri then hstmti hstmt0 i . . . Now,hstmt0 i :: else hstmti ε0FIRST (hstmt i) {ε, else}Also, FOLLOW(hstmtT0 i) {else, }But, FIRST(hstmt0 i) FOLLOW(hstmt0 i) {else} 6 φOn seeing else, there is a conflict between choosinghstmt0 i :: else hstmti and hstmt0 i :: ε grammar is not LL(1)!The fix:Put priority on hstmt0 i :: else hstmti to associate else withclosest previous then.*V.Krishna Nandivada (IIT Madras) FT 0 T /T ε num idA grammar that is not LL(1)Input: A string w and a parsing table M for a grammar GOutput: If w is in L(G), a leftmost derivation of w; otherwise, indicate an errorpush onto the stack; push S onto the stack;a points to the input tape;X stack.top();while X 6 doif X is a thenstack.pop(); inp ;else if M[X, a] X Y1 Y2 · · · Yk thenoutput the production X Y1 Y2 · · · Yk ;stack.pop();push Yk , Yk 1 , · · · Y1 in that order;11 E TE0 E E ε*39 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201940 / 98

Another example of painful left-factoringError recovery in Predictive ParsingHere is a typical example where a programming language fails tobe LL(1):stmt asginment call otherassignment id : expcall id (exp-list)Panic mode error recoverySkip input symbols till a “synchronizing” token appears.This grammar is not in a form that can be left factored. We mustfirst replace assignment and call by the right-hand sides of theirdefining productions:statement id : exp id( exp-list ) otherWe left factor:statement id stmt’ otherstmt’ : exp (exp-list)See how the grammar obscures the language semantics.V.Krishna Nandivada (IIT Madras)An error is detected when the terminal on top of the stack doesnot match the next input symbol or M[A, a] error.Q: How to identify a synchronizing token?Some heuristics:All symbols in FOLLOW(A) in the synchronizing set for thenon-terminal A.Semicolon after a Stmt production: assgignmentStmt;assignmentStmt;If a terminal on top of the stack cannot be matched? –pop the terminal.issue a message that the terminal was inserted.Q: How about error messages?*CS3300 - Aug 201941 / 98Some definitionsV.Krishna Nandivada (IIT Madras)*CS3300 - Aug 201942 / 98Bottom-up parsingRecallGoal:Given an input string w and a grammar G, construct a parsetree by starting at the leaves and working to the root.For a grammar G, with start symbol S, any string α such thatS α is called a sentential formIf α Vt , then α is called a sentence in L(G)Otherwise it is just a sentential form (not a sentence in L(G))A left-sentential form is a sentential form that occurs in the leftmostderivation of some sentence.A right-sentential form is a sentential form that occurs in the rightmostderivation of some sentence.An unambiguous grammar will have a unique leftmost/rightmostderivation.*V.Krishna Nandivada (IIT Madras)CS3300 - Aug 2019*43 / 98V.Krishna Nandivada (IIT Madras)CS3300 - Aug 201944 / 98

Reductions Vs DerivationsExample

compiler more manageable. V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 17 / 98 * Parsing: the big picture parser generator code parser tokens IR grammar Our goal is a flexible parser generator system V.Krishna Nandivada (IIT Madras) CS3300 - Aug 2019 18 / 98 * Differen

Related Documents:

The parsing algorithm optimizes the posterior probability and outputs a scene representation in a "parsing graph", in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dy-namically using a set of reversible Markov chain jumps. This computational framework

Model List will show the list of parsing models and allow a user of sufficient permission to edit parsing models. Add New Model allows creation of a new parsing model. Setup allows modification of the license and email alerts. The file parsing history shows details on parsing. The list may be sorted by each column. 3-4. Email Setup

the parsing anticipating network (yellow) which takes the preceding parsing results: S t 4:t 1 as input and predicts future scene parsing. By providing pixel-level class information (i.e. S t 1), the parsing anticipating network benefits the flow anticipating network to enable the latter to semantically distinguish different pixels

COMPILER DESIGN Lecture 15 Zhendong Su Compiler Design. Announcements HW4: OAT v. 1.0 –Parsing & basic code generation Zhendong Su Compiler Design. Why Inference Rules? Allow a compact, precise way of specifying language pro

COMPILER DESIGN LECTURE NOTES . Department of CSE - 2 - UNIT -1 1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM 1.2 Preprocessor A preprocessor produce input to compilers. They may perform the following functions. . 1.9 STRUCTURE OF THE COMPILER DESIGN Phases of a compiler: A compiler

Compiler Design 10 A compiler can broadly be divided into two phases based on the way they compile. Analysis Phase Known as the front-end of the compiler, the analysis phase of the compiler reads the source program, divides it into core parts, and then checks for lexical, grammar, and syntax errors.

Compiler Design 10 A compiler can broadly be divided into two phases based on the way they compile. Analysis Phase Known as the front-end of the compiler, the analysis phase of the compiler reads the source program, divides it into core parts,

Curriculum For Excellence Advanced Higher Physics Astrophysics 2 Compiled and edited by F. Kastelein Boroughmuir High School Source - Robert Gordon's College City of Edinburgh Council Historical Introduction The development of what we know about the Earth, Solar System and Universe is a fascinating study in its own right. From earliest times .