Lecture 3 - Stanford University

2y ago
11 Views
2 Downloads
514.60 KB
40 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Evelyn Loftin
Transcription

Lexical AnalysisLecture 3Instructor: Fredrik KjolstadSlide design by Prof. Alex Aiken, with modifications1

Outline Informal sketch of lexical analysis– Identifies tokens in input string Issues in lexical analysis– Lookahead– Ambiguities Specifying lexers (aka. scanners)– By regular expressions (aka. regex)– Examples of regular expressions2

Lexical Analysis What do we want to do? Example:if (i j)Z 0;elseZ 1; The input is just a string of characters:\tif (i j)\n\t\tz 0;\n\telse\n\t\tz 1; Goal: Partition input string into substrings– Where the substrings are called tokens3

What’s a Token? A syntactic category– In English:noun, verb, adjective, – In a programming language:Identifier, Integer, Keyword, Whitespace, 4

Tokens A token class corresponds to a set of strings ExamplesInfinite set– Identifier: strings of letters orvar1iportsfoo Person digits, starting with a letter– Integer: a non-empty string of digits– Keyword: “else” or “if” or “begin” or – Whitespace: a non-empty sequence of blanks,newlines, and tabs5

What are Tokens For? Classify program substrings according to role Lexical analysis produces a stream of tokens which is input to the parser Parser relies on token distinctions– An identifier is treated differently than a keyword6

Designing a Lexical Analyzer: Step 1 Define a finite set of tokens– Tokens describe all items of interest Identifiers, integers, keywords– Choice of tokens depends on language design of parser7

Example Recall\tif (i j)\n\t\tz 0;\n\telse\n\t\tz 1; Useful tokens for this expression:Integer, Keyword, Relation, Identifier, Whitespace,(, ), , ; N.B., (, ), , ; above are tokens, not characters8

Designing a Lexical Analyzer: Step 2 Describe which strings belong to each token Recall:– Identifier: strings of letters or digits, startingwith a letter– Integer: a non-empty string of digits– Keyword: “else” or “if” or “begin” or – Whitespace: a non-empty sequence of blanks,newlines, and tabs9

Lexical Analyzer: Implementation An implementation must do two things:1. Classify each substring as a token2. Return the value or lexeme (value) of the token––The lexeme is the actual substringFrom the set of substrings that make up the token The lexer thus returns token-lexeme pairs– And potentially also line numbers, file names, etc.to improve later error messages10

Example Recall:\tif (i j)\n\t\tz 0;\n\telse\n\t\tz 1;11

Lexical Analyzer: Implementation The lexer usually discards “uninteresting”tokens that don’t contribute to parsing. Examples: Whitespace, Comments12

True Crimes of Lexical Analysis Is it as easy as it sounds? Not quite! Look at some history . . .13

Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant E.g., VAR1 is the same as VAR1 A terrible design! Historical footnote: FORTRAN Whitespacerule motivated by inaccuracy of punch cardoperators14

FORTRAN Example Consider– DO 5 I 1,25– DO 5 I 1.2515

Lexical Analysis in FORTRAN (Cont.) Two important points:1. The goal is to partition the string. This isimplemented by reading left-to-right, recognizingone token at a time2. “Lookahead” may be required to decide where onetoken ends and the next token begins16

Lookahead Even our simple example has lookahead issues– i vs. if– vs. 17

Lexical Analysis in PL/I PL/I keywords are not reservedIF ELSE THEN THEN ELSE; ELSE ELSE THEN18

Lexical Analysis in PL/I (Cont.) PL/I Declarations:DECLARE (ARG1,. . ., ARGN) Can’t tell whether DECLARE is a keyword orarray reference until after the ).– Requires arbitrary lookahead! More on PL/I’s quirks later in the course . . .19

Lexical Analysis in C Unfortunately, the problems continue today C template syntax:Foo Bar C stream syntax:cin var; But there is a conflict with nested templates:Foo Bar Bazz 20

Review The goal of lexical analysis is to– Partition the input string into lexemes– Identify the token of each lexeme Left-to-right scan lookahead sometimesrequired21

Next We still need– A way to describe the lexemes of each token– A way to resolve ambiguities Is if two variables i and f? Is two equal signs ?22

Regular Languages There are several formalisms for specifyingtokens Regular languages are the most popular– Simple and useful theory– Easy to understand– Efficient implementations23

LanguagesDef. Let S be a set of characters. A languageover S is a set of strings of characters drawnfrom S24

Examples of Languages Alphabet Englishcharacters Language Englishsentences Alphabet ASCII Not every string ofEnglish characters is anEnglish sentence Note: ASCII characterset is different fromEnglish character set Language C programs25

Notation Languages are sets of strings. Need some notation for specifying which setswe want The standard notation for regular languages isregular expressions.26

Atomic Regular Expressions Single character' c ' {" c "} Epsilone {""}Not the empty set, but set witha single, empty, string.27

Compound Regular Expressions UnionA B {s s Î A or s Î B} ConcatenationAB {ab a Î A and b Î B} IterationA ! i ³0 A where A A.i times . A*ii28

Regular Expressions Def. The regular expressions over S are thesmallest set of expressions includinge'c 'where c Î åA B where A, B are rexp over åAB*A"""where A is a rexp over å29

Syntax vs. Semantics To be careful, we should distinguish syntaxand semantics.L(e ) {""}L(' c ') {" c "}L( A B ) L( A)È L( B )L( AB)*L( A ) {ab a Î L( A) and b Î L( B)}i ! i ³0 L( A )30

Segue Regular expressions are simple, almost trivial– But they are useful!We can describe tokens in regularexpressions. . .31

Example: KeywordKeyword: “else” or “if” or “begin” or ‘else’ ‘if’ ‘begin’ . . .Note: ‘else’ abbreviates‘e’’l’’s’’e’32

Example: IntegersInteger: a non-empty string of digitsdigit '0 ' '1' '2 ' '3' '4 ' '5' '6 ' '7 ' '8' '9 'integer digit digit * Abbreviation: A AA*33

Example: IdentifierIdentifier: strings of letters or digits,starting with a letterletter‘z’ ‘A’ . . . ‘Z’ ‘a’ . . . identifier letter (letter digit)*Breakout: is (letter* digit*) the same as (letter digit)*?34

Example: WhitespaceWhitespace: a non-empty sequence of blanks,newlines, and tabs( ' ' '\n' '\t') 35

Example: Phone Numbers Regular expressions are all around you! Consider (650)-723-3232å digits È {-,(,)}3exchange digitphonearea digit 4 digit 3phone number '(' area ')-' exchange '-' phone36

Example: Email Addresses Consider anyone@cs.stanford.eduå letters È {.,@}name letter address name '@' name '.' name '.' name37

Example: Unsigned Pascal Numbersdigitdigitsopt fraction '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' digit ('.' digits) eopt exponent ('E' (' ' '-' e ) digits) enum digits opt fraction opt exponent38

Other Examples File names Grep tool family39

Summary Regular expressions describe many usefullanguages– We will look at non-regular languages next week Regular languages are a language specification– We still need an implementation Next time: Given a string s and a rexp R, iss Î L( R ) ?40

A terrible design! Historical footnote: FORTRAN Whitespace rule motivated by inaccuracy of punch card operators. 15 FORTRAN Example Consider –DO 5 I 1,25 –DO 5 I 1.25. 16 Lexical Analysis in FORTRAN (Cont.) Two importa

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Qingyuan Zhao Stanford University qyzhao@stanford.edu Murat A. Erdogdu Stanford University erdogdu@stanford.edu Hera Y. He Stanford University yhe1@stanford.edu Anand Rajaraman Stanford University anand@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford .

Computer Science Stanford University ymaniyar@stanford.edu Madhu Karra Computer Science Stanford University mkarra@stanford.edu Arvind Subramanian Computer Science Stanford University arvindvs@stanford.edu 1 Problem Description Most existing COVID-19 tests use nasal swabs and a polymerase chain reaction to detect the virus in a sample. We aim to

Domain Adversarial Training for QA Systems Stanford CS224N Default Project Mentor: Gita Krishna Danny Schwartz Brynne Hurst Grace Wang Stanford University Stanford University Stanford University deschwa2@stanford.edu brynnemh@stanford.edu gracenol@stanford.edu Abstract In this project, we exa

Stanford University Stanford, CA 94305 bowang@stanford.edu Min Liu Department of Statistics Stanford University Stanford, CA 94305 liumin@stanford.edu Abstract Sentiment analysis is an important task in natural language understanding and has a wide range of real-world applications. The typical sentiment analysis focus on

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016