YAY Reference Manual

Thinkage Ltd.
85 McIntyre Drive
Kitchener, Ontario
Canada N2R 1H6
Copyright © 1996, by Thinkage

1. Introduction
    1.1  A Note to Novices
2. How YAY Works
    2.1  "yyparse" and "yylex"
    2.2  Grammar Rules
3. Input to YAY
    3.1  The Declarations Section
        3.1.1  Token Declarations
        3.1.2  Variable/Function Declarations
        3.1.3  Notes on Compiling Source Code Produced by YAY
        3.1.4  Summary
    3.2  Simple Grammar Rules
        3.2.1  Recognition Actions
        3.2.2  Token and Symbol Values
        3.2.3  The Start Symbol
        3.2.4  The End Marker
        3.2.5  Declarations in yyparse
        3.2.6  Embedded Actions
    3.3  The Programs Section
        3.3.1  The Header File
        3.3.2  The Lexical Analyzer
4. Precedence and Binding
    4.1  Precedence
    4.2  Binding
    4.3  Precedence and Binding of Operators
    4.4  Precedence for Grammar Rules
    4.5  Default Precedence for Rules
5. Internal Structures
    5.1  States
    5.2  Diagramming States
    5.3  State Actions
        5.3.1  The Accept Action
        5.3.2  The Shift Action
        5.3.3  The Reduce Action
        5.3.4  The Goto Action
        5.3.5  The Error Action
6. Resolving Ambiguities
    6.1  Selections and Rejections
        6.1.1  Embedded Selections and Rejections
        6.1.2  Alternate Forms
        6.1.3  Selections, Rejections, and Precedences
        6.1.4  Priority Classes
        6.1.5  Using Selections and Rejections
        6.1.6  A Typical Example
    6.2  Arbitration Rules
    6.3  Conflicts from Embedded Actions
    6.4  Null Rules
7. YAY Output
    7.1  Rules Summary
    7.2  Variable Descriptions
    7.3  Terminal Symbol Descriptions
    7.4  State Descriptions
        7.4.1  Renumbering States and Rules
        7.4.2  Deleted Items
    7.5  Conflict Summary
    7.6  Parser Statistics
8. Error Handling
    8.1  The "$catch" Symbol
    8.2  The "error" Symbol
    8.3  Default Actions
        8.3.1  Conflicts in Default Actions
        8.3.2  %nodefault
    8.4  The Error Action
        8.4.1  Error Recovery Mode
        8.4.2  Using yyerrflag
        8.4.3  The yyerrok Macro
    8.5  Writing Rules with Errors in Mind
    8.6  The yyclearin Macro
    8.7  YYSERROR
        8.7.1  The yyerror Function
        8.7.2  Changing yychar in yyerror
    8.8  Other Error Support Routines
        8.8.1  The YYABORT Macro
        8.8.2  The YYACCEPT Macro
        8.8.3  YYE_SHIFT and YYE_NONE
        8.8.4  The YYERROR Macro
        8.8.5  How YYERROR May Be Used
    8.9  The $error Symbol
9. Types
    9.1  The %union Statement
        9.1.1  The %type Statement
        9.1.2  Assigning Types to Tokens
        9.1.3  Explicit Typing
        9.1.4  Inferred Types
    9.2  The %struct Statement
    9.3  The Default Action
    9.4  Types in Rules with Embedded Actions
10. Trial-and-Error Parsing
    10.1  The YYPOP Macro
    10.2  YYPOP and "yylex"
11. Multi-Valued Grammar Rules
    11.1  The $select Operator
    11.2  Why Use Multi-Valued Grammar Rules
    11.3  The $test Operator
    11.4  $select with Multiple Arguments
12. YAY Macros
    12.1  Declaring Macros in %type Statements
    12.2  Defining a Macro
    12.3  Macro Calls
    12.4  Multi-Valued Macro Definitions
13. Advanced Topics
    13.1  Negative Numbers in $N Constructs
    13.2  Lists and Handling Null Strings
    13.3  %nonnull and %null
    13.4  Right vs. Left Recursion
    13.5  Debugging YAY Parsers
        13.5.1  YYDEBUG
        13.5.2  Converting Token Integers Into Strings
        13.5.3  Displaying Rules
        13.5.4  Displaying States
        13.5.5  Displaying Goto Operations
        13.5.6  Displaying Your Own Debugging Output
        13.5.7  Important Symbols For Debugging
    13.6  Dynamic Stack Allocation
    13.7  Synchronizing the Lookahead
    13.8  YYSHIFT
14. Using YAY Interactively
    14.1  General Principles
        14.1.1  Getting Help
        14.1.2  Abbreviating Commands
        14.1.3  Multiple Commands on a Line
    14.2  Examining a State
    14.3  The Jump Command
    14.4  Displaying Information about a State
        14.4.1  The DisPlay Command
    14.5  Displaying Other Information
        14.5.1  The Effect of an Input Item
        14.5.2  Displaying Grammar Rules
        14.5.3  Selective Display Commands
    14.6  Test Parsing
        14.6.1  Input Format
        14.6.2  Removing Input
        14.6.3  Simple Test Parsing
        14.6.4  Action Codes
        14.6.5  Displaying the Stack
        14.6.6  Quick Runs
        14.6.7  Simulating Errors
        14.6.8  Simulating YYPOP
        14.6.9  Backing Up
        14.6.10  The EXpand Command
        14.6.11  Generating Input
        14.6.12  The Avoid Command
        14.6.13  Resolving Conflicts
        14.6.14  Clearing the State Stack and Input Stream
    14.7  Miscellaneous Commands
        14.7.1  Setting the Width of Output Lines
        14.7.2  Setting the Prompt String
        14.7.3  Executing a System Command
Appendix A: Examples
    A.1  A Simple Example
    A.2  An Example of YYPOP
Appendix B: YAY vs. YACC
Appendix C: User-Accessible Variables and Macros
Appendix D: Producing C++ Source Code with YAY
    D.1  C++-Style Dynamic Stack Allocation
Appendix E: Producing B Source Code with YAY

1. Introduction

YAY is a tool for writing compilers and other programs that parse input according to strict "grammar" rules. For information about input parsing and the construction of grammars, see any standard text on parsing theory, e.g. Principles of Compiler Design by A.V.Aho and J.D.Ullman, Addison-Wesley, 1977, or LR Parsing by Aho and Johnson, in Computing Surveys.

YAY lets your code perform an unlimited amount of lookahead, then rewind to a previous state. This makes it possible to parse grammars for languages like C++, which requires unrestricted lookahead to handle certain code constructs.

The appearance of YAY input is based on the input to YACC (Yet Another Compiler-Compiler), written by S.C.Johnson of Bell Telephone Laboratories, Murray Hill, N.J. The LALR(1) version of YAY was written by K.W.Lalonde of the Software Development Group of the University of Waterloo. Enhancements to allow YAY to handle higher order grammars were made by P.J.Fraser and K.P.Martin of Thinkage Ltd.

Some of the algorithms used by YAY are derived from the article "Methods for Computing LALR(k) Lookahead" by B.B.Kristensen and O.L.Madsen, ACM Transactions on Programming Languages and Systems, Vol.3, No.1, January 1981, pp.60-82. Those interested in reading this article should first have a good grasp of parsing theory principles.

1.1 A Note to Novices

YAY can produce anything from a simple parser for a desk calculator program to a very complicated parser for a programming language. Those who are using YAY for complex tasks have to know all the idiosyncrasies of YAY, including a good deal about the internal workings of YAY's software. On the other hand, the internal workings are mostly irrelevant to someone who is making an easy straightforward parser.

For this reason, novices may want to skip some the sections of the manual which describe very technical aspects of YAY. This is particularly true on first reading, when it is more important to get an overview of how YAY is used than to try to comprehend every little detail. Novices may be particularly interested in Appendix A which gives examples of YAY input.

2. How YAY Works

The input to YAY describes the rules of a grammar. YAY uses these rules to produce the source code for a program that will parse the grammar. This source code can then be compiled to obtain a program that reads input, parses it according to the grammar, and takes action based on the result.

By default, the source code produced by YAY is written in the C programming language. The version of YAY written for GCOS8 can also produce source code in B or C++. The main body of this text deals with generating source code in C and C++, while Appendix E discusses changes needed when generating source code in B. Appendix D discusses additional details of using YAY to generate C++ code.

The source code produced by YAY consists of a number of data tables that represent the grammar, plus a C function named "yyparse". By default, all the symbol names used by YAY begin with "yy" or "YY". This is an historical convention, dating back to YAY's predecessor YACC. Users can avoid conflicts with YAY names by avoiding symbols that start with "yy"/"YY".

Users who want to use a different prefix may indicate this with a line of the form

          %prefix PREFIX

at the beginning of their YAY input. For example,

          %prefix ww

asks for a prefix of "ww" instead of "yy". The prefix chosen should be one or two characters long--longer prefixes lead to name conflicts on machines where external names are truncated to six characters during the loading process. In addition, at least one of the characters in the prefix should be a lower case letter (because YAY uses an all upper case version of the prefix for some special names, and this has to be different from the real prefix).

Different prefixes are useful when two YAY-produced parsers are to be merged into a single program. For the sake of convenience, however, this manual uses the "yy" convention.

2.1 "yyparse" and "yylex"

"yyparse" returns a value of 0 if the input it parses is valid according to the given grammar rules. It returns a value of 1 if the input is invalid and error recovery is impossible.

"yyparse" does not do its own lexical analysis. In other words, it does not pull the input apart into tokens ready for parsing. Instead, it calls a routine called "yylex" every time it wishes to obtain a token from the input.

"yylex" returns a value indicating the type of token that has been obtained. If the token has an actual value, this value (or some representation of the value, e.g. a pointer to a string containing the value) is returned in an external variable named "yylval".

You must write your own "yylex" routine. It must break the input into tokens and return the tokens one by one to "yyparse". Section 3.3.2 provides more details about the lexical analyzer.

2.2 Grammar Rules

The grammar rules given to YAY not only describe what inputs are valid according to the grammar but also specify what action should be taken when a given input is encountered. For example, if the parser recognizes a statement that assigns a value to a variable, the parser should either perform the assignment itself or take some action to ensure that the assignment eventually takes place.

As an example, if the parser is part of an interactive desk calculator, the parser could carry out arithmetic calculations as soon as the instructions are recognized. However, if the parser is acting as the first pass in a multi-pass compiler, it may simply encode the input in a way that will be used in a later code-generation pass.

In summary, you must provide the following when you use YAY to produce a parser:

grammar rules indicating what input is and isn't valid;
a lexical analyzer ("yylex") that breaks raw input into tokens for the parsing routine "yyparse";
any source code or functions that may be needed to perform appropriate actions once particular inputs are recognized;
a mainline routine that performs any necessary initializations, calls "yyparse", then performs possible clean-up actions. The simplest kind of mainline is just a function "main" that calls "yyparse", then returns.

3. Input to YAY

The input to YAY is broken into three sections:

declarations
grammar rules
programs

The contents of each section will be described shortly, but first we give some overall rules for YAY input.

Input Sections: Sections of YAY input are separated by the symbol %%. The general lay-out of YAY input is therefore

          declarations
          %%
          grammar rules
          %%
          programs

The declarations section may be omitted if no declarations are necessary. In this case, the input starts with the first %%. The programs section may also be omitted, from the second %% on. The simplest input for YAY is therefore

          %%
          grammar rules

White-Space: Blanks and tabs are used to separate items in YAY input. These are called white-space characters. Any place one white-space character is valid, you may add any number of additional blanks and tabs. The new-line is also a white-space character; however, certain declaration section statements (beginning with a % keyword) should be on a single line of input rather than being divided over several lines.

Comments: Comments may appear anywhere white-space is valid. As in C, comments begin with /* and end with */.

Identifiers: Identifiers used in YAY input may be as long as you want, and may consist of all letters (upper and lower case), all digits, and the characters dot (.), underscore (_), and the minus sign (-). The first character of an identifier may not be a digit. YAY distinguishes between upper and lower case letters, so "this" and "THIS" and "This" are all different symbols.

The dollar sign ($) is also valid in YAY identifiers but should be avoided, since YAY generates names of its own using the dollar sign. You should also avoid using an identifier that consists only of a dot or a minus sign.

Since YAY does not support negative numbers, a construct like -123 would be interpreted as an identifier (a minus sign followed by digits). Such confusing constructs should be avoided. They are only supported for backward compatibility with YACC.

Literals: Literals in YAY input consist of a single ASCII character enclosed in single quotes, e.g. 'c'. The standard C escape sequences are recognized:

          \b   -- backspace
          \e   -- ESC character
          \f   -- formfeed
          \n   -- new-line
          \r   -- carriage return
          \t   -- horizontal tab
          \v   -- vertical tab
          \'   -- single quote
          \"   -- double quote
          \\   -- backslash
          \ooo -- any ASCII character
                  ("ooo" is octal representation)
          \Xnnnnn... -- string of characters
                  ("nn..." is a sequence of hexadecimal
                   digits)

There is no limit to the number of hexadecimal digits that may be specified in a hex escape sequence. For example, '\X01ab' is a single-character token with value Ox1ab. Octal and hexadecimal representations of a value must have the same "spelling" to be recognized as the same token. For example, '\1' and '\01' are considered to be different tokens. YAY will then issue an error message that these different tokens have the same value.

Literals may be enclosed in double quotes instead of single quotes. However, you can't enclose the same literal in single quotes sometimes and double quotes other times. For example, if your input contains both "+" and '+', YAY issues an error message that two different tokens have the same value.

YAY allows multi-character literals, as in "abc" or 'abc'. YAY will associate a numeric value with each such token; however, there is no way to tell "yylex" what value has been associated with these tokens, so there is no good way for "yylex" to report that it has found one of these tokens.

3.1 The Declarations Section

The declarations section describes many of the identifiers that will be used in the rest of the YAY input. There are two types of declarations:

Token declarations;
Declarations of functions and variables used in the actions that the parser takes when a particular input is recognized.

The declarations section may also specify the precedence and binding of tokens used in the grammar. In an expression parser, for example, the standard order of arithmetic operations would normally be set up in the declarations section.

3.1.1 Token Declarations

All literals are automatically recognized as tokens. Single-character literals have an associated value equal to the ASCII value of the character. For example, 'a' stands for a token whose associated value is octal 141 (ASCII a).

Other tokens may be declared with statements of the form

          %token name1 name2 name3 ...

This tells YAY that the given names will refer to tokens. For example,

          %token INTNUM

defines the identifier INTNUM as a particular type of token returned by the lexical analyzer "yylex". If INTNUM stands for any integer number token, you might have the following code in "yylex".

          c = getchar();
          if ( isdigit(c) ) {
              yylval = 0;
              do {
                  yylval = (yylval * 10) + (c - '0');
                  c = getchar();
              } while ( isdigit(c) );
              ungetc(c, stdin);
              return( INTNUM );
          }

"yylex" returns INTNUM to indicate that a certain kind of token (an integer number) has been returned. The actual value of this number is returned in the variable named "yylval". Grammar rules later in the YAY input will dictate where an INTNUM token is valid.

In the C source code produced by YAY, the identifiers named in a %token statement appear as constants set up with #define. Token values are sufficiently large to avoid conflicts with the standard ASCII characters (which may be referenced as 'x'-style literals).

Because token identifiers are set up as manifest constants, they must not conflict with reserved words in the programming language or other identifiers used by the parser. For example,

          %token if yyparse ...

will almost certainly lead to errors when you try to compile the source code output of YAY. Token names must also be valid in the programming language used for "yyparse". For example,

          %token IF-STMT

defines a token name that is valid in YAY but not valid in C or C++. Therefore, when YAY generates a #define statement to assign a value to IF-STMT, the resulting statement will not compile.

By convention, this manual uses upper case letters for token names; most YAY programmers follow the same practice.

Note: All named tokens must be declared either in a %token statement or in a %left, %right or %nonassoc statement (described in Section 4.3).

3.1.2 Variable/Function Declarations

The declarations section may contain standard C declarations for variables and/or functions used in the actions specified in the grammar rules section. All such declarations should be included in a block that begins with %{ and ends with %}. For example,

          %{
              int i, j, k;
              static float x = 1.0;
          %}

gives a few variable declarations. These declarations are essentially transferred "as is" to the beginning of the source code that is produced by YAY. This means that they are external to "yyparse" and therefore extern definitions.

You can use %{ and %} to enclose any code that should be directly passed into the source code that YAY produces. For example, you can use them to enclose #include and #define statements that your code might need.

3.1.3 Notes on Compiling Source Code Produced by YAY

By default, C source code produced by YAY contains the line

          #define YYSCTAB const

It then uses the YYSCTAB manifest to declare various data objects as const. If you will be compiling the YAY output with an old C compiler that doesn't recognize the const qualifier, you should place the definition

          #define YYSCTAB

in the declarations section of your YAY input.

3.1.4 Summary

The source code produced by YAY contains the following.

code derived from input in the declarations section
code explicitly given in the programs section
parsing tables produced by YAY to represent the grammar
the "yyparse" routine

3.2 Simple Grammar Rules

A simple YAY grammar rule has the general form

          identifier : definition ;

A colon separates the definition from the identifier being defined. A semicolon marks the end of the definition.

The identifiers defined in the grammar rule section are known as non-terminal symbols. "Non-terminal" suggests that these symbols aren't the "end of the line". Instead, they are made up of smaller things: tokens or other non-terminal symbols.

Here is a simple example of the definition of a non-terminal symbol.

          paren_expr : '(' expr ')' ;

This says that a "paren_expr" consists of a left parenthesis, followed by an "expr", followed by a right parenthesis. The "expr" is either a token or a non-terminal symbol defined in another grammar rule. This grammar rule could be interpreted to say that a parenthesized expression consists of a normal expression inside parentheses.

A non-terminal symbol may have more than one definition. For example, we might define "if" statements with

          if_stat : IF '(' expr ')' stat ;
          if_stat : IF '(' expr ')' stat ELSE stat ;

This definition assumes that IF and ELSE are tokens recognized by the lexical analyzer (which means that this parser's "yylex" can recognize keywords). The definition also assumes that "expr" and "stat" are non- terminal symbols defined elsewhere.

When a single symbol has more than one meaning, YAY lets you join the various possibilities into a single definition. Different meanings are separated by or-bars (|). Thus you could write

          if_stat : IF '(' expr ')' stat
                  | IF '(' expr ')' stat ELSE stat ;

This technique is highly recommended, since it makes YAY input more readable.

Definitions in a grammar may be recursive. For example,

          list : item
               | list ',' item;

defines "list" to be one or more items separated by commas.

          intexp : '(' intexp ')'
                 | intexp '+' intexp
                 | intexp '-' intexp
                 | intexp '*' intexp
                 | intexp '/' intexp
                 | INTNUM ;

says that an integer expression can be another integer expression in parentheses, the sum of integer expressions, the difference of integer expressions, the product of integer expressions, the quotient of integer expressions, or an integer number standing on its own (where INTNUM is a token recognized by the lexical analyzer).

In recursive symbol definitions, it is often useful to have the empty string as one of the possible definitions. For example,

          program :
                /* the empty string */
                | statement ';' program ;

defines a program as zero or more statements which end in semicolons.

The definition

          list : item
               | list ',' item;

is an example of left recursion because "list" is on the left in the recursive definition. The definition

          program :
                /* the empty string */
                | statement ';' program ;

is an example of right recursion. As discussed in Section 13.4, left recursion is usually better than right recursion and other recursive constructs such as

          expr : '(' expr ')' ;

However, the requirements of your grammar may not give you a choice of which type of recursion to use.

3.2.1 Recognition Actions

In addition to defining what a non-terminal symbol is, a grammar rule usually describes what to do if the non-terminal symbol is encountered in parser input. We call this a recognition action.

Recognition actions are specified as part of the grammar rule. They are enclosed in brace brackets in the definition, as in

          symbol : definition { action } ;

For example, consider the following:

          break_stat : BREAK ';'  { breakfn(); } ;

In this definition, "break_stat" is a non-terminal symbol made up of the token known as BREAK, followed by a semicolon. If this symbol is recognized, the parser invokes a function named "breakfn". Presumably this is a user-defined function that handles a "break;" statement.

A semicolon is needed to mark the end of the definition even though the recognition action ends in a brace bracket. Programmers who are used to the way C works should bear this in mind.

For compatibility with some versions of YACC, YAY lets you put an equals sign (=) before the opening brace that begins a recognition action, as in

          break_stat : BREAK ';' = { breakfn(); } ;
                     | /* other definitions */
                     ;

With this syntax, a semicolon immediately following the brace brackets is considered to be part of the recognition action. Therefore, another semicolon is required to end the rule, as shown above.

When a symbol has more than one definition, a different recognition action may be associated with each definition, as in

          symbol : definition1 { action1 }
                 | definition2 { action2 }
                 | definition3 { action3 }
                 ;

A semicolon is required to end the list of definitions. As shown above, many programmers like to align the semicolon with the or-bars for preceding definitions, to make the semicolon more visible.

The code inside a recognition action must be valid C or C++ code; it is incorporated "as is" in the final version of "yyparse". For example, if you are generating C++ code, recognition actions may include C++-style comments (beginning with // and extending to the end of the line). Such comments are valid in the C++ code for recognition actions, but not in normal YAY input (outside recognition actions).

3.2.2 Token and Symbol Values

One of the most common recognition actions is to return a value. For example, if an input is recognized as an expression to be evaluated, the parser may want to return the resulting value of the expression. To return a value, the recognition action merely assigns the value to a special variable named $$. For example,

          hexdigit : '0' { $$ = 0; }
                   | '1' { $$ = 1; }
                           ...
                   | 'A' { $$ = 10; }
                   | 'B' { $$ = 11; }
                   | 'C' { $$ = 12; }
                   | 'D' { $$ = 13; }
                   | 'E' { $$ = 14; }
                   | 'F' { $$ = 15; } ;

could be a way to convert hexadecimal digits into numeric values. In this case, "yylex" just returns the digits it finds and "yyparse" performs the actual conversion.

Another common recognition action is to return a value based on one or more of the items that make up the non-terminal symbol. Inside the recognition action, $1 stands for the value of the first item in the symbol, $2 stands for the value of the second item, and so on. If the item is a token, its value is the "yylval" value returned by "yylex" when the token was read. If the item is non-terminal symbol, its value is the $$ value set by the recognition action associated with the symbol. Thus we might write

          intexp : '(' intexp ')'
                        {
                            $$ = $2;
                        }
                 | intexp '+' intexp
                        {
                            $$ = $1 + $3 ;
                        }
                 | intexp '-' intexp
                        {
                            $$ = $1 - $3 ;
                        }
                 |  /* and so on */
                 ;

Before performing a recognition action, "yyparse" performs the assignment

          { $$ = $1 ; }

This means that if there is no recognition action or if the recognition action does not assign a value to $$, the value of the non-terminal symbol ($$) is equal to the value of its first component ($1).

3.2.3 The Start Symbol

The first non-terminal symbol defined in the rules section is called the start symbol. This symbol is taken to be the largest, most general structure described by the grammar rules. For example, if you are generating the parser for a compiler, the start symbol should describe what a complete program looks like in the language to be parsed.

If you do not want the first grammar rule to be taken as the start symbol, you can use the directive

          %start NAME

in your rules section. This indicates that the non-terminal symbol named NAME is the start symbol. NAME must be defined somewhere in the rules section.

The start symbol must be all-encompassing: every other rule in the grammar must be related to the start symbol. In a sense, the grammar forms a "tree": the root is the start symbol, the first set of branches are the symbols that make up the start symbol, the next set of branches are the symbols that make up the first set, and so on. Any symbol that is "outside" this tree is reported as a unused variable in YAY output. Such variables are ignored by the parser--the parser is looking for a complete start symbol, and nothing else.

3.2.4 The End Marker

The end of parser input is marked by a special token called the end marker. This token is written as $end; the value of the token is zero.

It is the job of the lexical analyzer "yylex" to return a zero when the end of input is reached (for example, when "yylex" encounters end of file or a keyword indicating the logical end of the input).

"yyparse" terminates when it has parsed a start symbol followed by the end marker.

3.2.5 Declarations in yyparse

If "yyparse" needs any local declarations, you specify them in much the same way that you specify external declarations in the declarations section. Enclose the declarations in %{ and %} symbols, and place them before the first rule, as in

          %{
              /* External declarations */
          %}

          %%
          /* Rules Section starts here */
          {%
              /* Declarations here are local to "yyparse" */
          %}
          /* Rules */

          %%
          /* Program section */

YAY puts brace brackets around each recognition action in the final code that is generated. Therefore recognition actions may start with declarations local to that action, as in

          a : B  {
              int i;
              for (i=0; i<100, i++)
                  /* and so on */
          } ;

3.2.6 Embedded Actions

A rule may have several actions embedded in it. For example, you might have

          a : A1 {action1} A2 {action2} A3 {action3};

The non-terminal symbol "a" consists of symbols A1, A2, and A3. When A1 is recognized, "action1" is invoked; when A2 is recognized, "action2" is invoked; and when A3 is recognized (and therefore the entire symbol "a"), "action3" is invoked. In this case,

          $1   -- is the value of A1
          $2   -- is the value of $$ in action1
          $3   -- is the value of A2
          $4   -- is the value of $$ in action2
          $5   -- is the value of A3

To deal with embedded actions, YAY changes the form of the given grammar rule and creates grammar rules for dummy symbols. The dummy symbols have names made up of a $ followed by the rule number. For example,

          a : A1 {action1} A2 {action2} A3 {action3};

might be changed to the rules

          $21 : /* null */ {action1} ;
          $22 : /* null */ {action2} ;
          a : A1 $21 A2 $22 A3 {action3};

See Section 6.3 for a discussion of side-effects that may arise from this process.

3.3 The Programs Section

The programs section of YAY input may contain functions that should be linked in with the "yyparse" routine. YAY itself does nothing with these functions--it simply adds the source code on the end of the source code produced from the grammar rules. In this way, the functions can be compiled at the same time that the YAY-produced code is compiled.

3.3.1 The Header File

Additional functions may also be compiled separately and linked with the YAY-produced code later on, once everything is in object code format. Separate compilation of modules is strongly recommended for large parsers. However, functions that are compiled separately need a special mechanism if they want to use manifests (such as the token value manifests) that are defined in the YAY-produced code.

For example, consider the case of "yylex". Every time "yylex" obtains a token from the input, it returns to "yyparse" with a value that indicates the type of token found. Obviously, "yylex" and "yyparse" must agree on which return values indicate which kind of tokens. Since "yyparse" refers to tokens using manifest constants (created in the declarations section with %token directives), "yylex" should use the same manifest constants.

If you are going to compile "yylex" or other routines separately from "yyparse", use the

          Header=filename

option on the YAY command line. YAY writes the manifest definitions to the specified file. This is called the header file associated with the grammar.

A header file can be #included to obtain manifest definitions for "yylex" or any other routine that needs them. You only need to create a header file if you are compiling some modules separately.

3.3.2 The Lexical Analyzer

The lexical analyzer "yylex" reads input and breaks it into tokens. It is "yylex" that determines what constitutes a token. For example, some lexical analyzers may return numbers one digit at a time while others collect numbers in their entirety before passing them to the parser.

Similarly, some lexical analyzers may recognize keywords such as IF or WHILE and will tell the parser that an IF token or WHILE token has been found. Others may not be designed to recognize keywords, so it is up to the parser itself to distinguish between keywords and other things like variable names.

As noted before, each token named in the declarations section of the YAY input is set up as a manifest constant. The value of the first token named is 257, the value of the next is 258, and so on. You can also set your own values for tokens by placing a positive integer after the first appearance of any token in the declarations section. For example,

          %token AA 56

assigns a value of 56 to the manifest (token) symbol AA. This mechanism is seldom needed, and you should avoid it whenever possible.

There is little else to say about requirements for "yylex". If it wishes to return the value of a token as well as an indication of its type, the value is assigned to the external variable "yylval". By default, "yylval" is defined as an int value but it can also be used to hold other types of values. For more information about types for "yylval", see the description of %union and %struct in Chapter 9.

4. Precedence and Binding

Parsers that evaluate expressions usually have to establish the order in which various operations are carried out. For example, parsers for arithmetic expressions typically carry out multiplications before additions. Two factors affect order of operation: precedence and binding.

4.1 Precedence

Precedence dictates which of two different operations should be carried out first. For example, in

          A + B * C

the standard arithmetic rules of precedence dictate that the multiplication should take place before the addition. Operations that should be carried out first are said to have a higher precedence than operations that should be performed later.

Different operators can sometimes have the same precedence. In C, for example, addition and subtraction are similar enough to share the same precedence.

4.2 Binding

Binding indicates which of two similar operations should be carried out first. By "similar", we mean operations with the same precedence (e.g. addition and subtraction in C). For example, C chooses to parse

          A - B - C

          (A - B) - C

while a language like APL would use

          A - (B - C)

If you evaluate the first operation before the second (as C does), the operation is left- associative. If you evaluate the second operation before the first (as APL does), the operation is right-associative.

Occasionally, a compiler may have operations which are not associative. For example, Fortran regards

          A .GT. B .GT. C

as invalid. In this case, the operation is non-associative.

4.3 Precedence and Binding of Operator Tokens

The precedence and associativity of operator tokens may be declared in the declarations section using the keywords

          %left
          %right
          %nonassoc

For example,

          %left '+' '-'

indicates that the + and - operations have the same precedence and are left- associative.

Associativity declarations should be given in order of precedence. Operations with lowest precedence are listed first, and those with highest precedence are listed last. Operations with equal precedence are listed on the same line. For example,

          %right '='
          %left  '+' '-'
          %left  '*' '/' '%'

says that = has a lower precedence than + and -, which in turn have a lower precedence than *, /, and %. = is also right- associative, so that

          A = B = C

is parsed as

          A = (B = C)

Because of the way YAY specifies precedence and associativity, operators with equal precedence always have the same associativity. For example, if A and B have equal precedence, their precedence must have been set with one of

          %left A B
          %right A B
          %nonassoc A B

which means A and B must have the same associativity.

The names supplied with %right, %left, and %nonassoc may be literals or YAY identifiers. If they are identifiers, they are regarded as token names. YAY generates a %token directive for such names if they have not already been declared. For example, in

          %left '+' '-'
          %left '*' '/'
          %left UMINUS

UMINUS is taken to be a token identifier. There is no need to define UMINUS as a token identifier--a %token directive is generated automatically if necessary. It is perfectly valid to have an explicit

          %token UMINUS

if you want. However, it must precede the %left declaration.

Note: It is possible to write grammar rules that enforce precedence without using %left, %right and %nonassoc. For example, you might define

          assign-expr : var '=' assign_expr
                      | add_expr ;
          add_expr    : add_expr '+' mult-expr                      | add_expr '-' mult-expr
                      | mult_expr ;
          mult_expr   : mult_expr '*' var
                      | mult_expr '/' var
                      | var ;

and so on. However, using %left, %right and %nonassoc makes for simpler grammars that are easier to understand.

4.4 Precedence for Grammar Rules

%left, %right, and %nonassoc assign precedences to tokens. Precedence can also be assigned to grammar rules, and this is done in the grammar input section.

One way to give a grammar rule a precedence uses the %prec construct.

          %prec TOKEN

in a grammar rule indicates that the rule should have the same precedence as the specified token.

As a simple example, consider the case of the unary minus operator. Suppose our declaration section contains

          %left '+' '-'
          %left '*' '/'
          %left UMINUS

In the grammar rules section, you can write

          exp : exp '+' exp
              | exp '-' exp
              | exp '*' exp
              | exp '/' exp
              | '-' exp %prec UMINUS
              | /* and so on */
              ;

You could not directly set up a precedence for the unary minus, since you had already set up a precedence for the - token. Instead, the YAY input defines a token named UMINUS and gives it the precedence you want to assign the unary minus. The grammar rule for the unary minus uses

          %prec UMINUS

to show that this rule should have the precedence of UMINUS.

As another example, you might set up precedence rules for right shift and left shift operations with

          %left RS LS
              ...
          exp :
              | exp '<' '<' exp %prec LS
              | exp '>' '>' exp %prec RS
              ...

This gives the shift operations the proper precedence and avoids confusing them with the comparison operations > and <. Depending on your lexical analyzer, however, the above rules may allow white space between the two > or < characters. Therfore, you typically make your lexical analyzer clever enough to recognize >> and << and to return the RS or LS tokens directly.

Although symbols like UMINUS, LS, and RS are treated as tokens, they do not have to correspond to actual input. They may just be placeholders for operator tokens that have two different meanings.

4.5 Default Precedence for Rules

If a grammar rule is not assigned a precedence using %prec, the precedence of the rule is taken from the last token in the rule. For example, if the rule is

          expr : expr '+' expr

the last token in the rule is + (since "expr" is a non-terminal symbol, not a token). Thus the precedence of the rule is the same as the precedence of +.

If the last token in a rule has no assigned precedence, the rule does not have a precedence. This can result in some surprises if you aren't careful. For example, if you define

          expr : expr '+' expr ';'

the last token in the rule is ; so the rule probably won't have a precedence even if + does.

Note: Precedence can also be controlled by selection and rejection constructs, as described in Section 6.1. If a rule contains a selection or rejection construct (or both), YAY does not assign the rule a default precedence. If you want the rule to have a precedence, you must explicitly assign one using %prec.

5. Internal Structures

In order to use YAY effectively, it is helpful to understand some of the internal workings of the parser that YAY produces. This chapter looks at some of these structures.

Note: All examples in this chapter use a parser produced by the grammar

          %token NUM
          %left '+' '-'
          %left '*' '/'
          %%
          expr : NUM
               | expr '+' expr
               | expr '-' expr
               | expr '*' expr
               | expr '/' expr
               | '(' expr ')'
               ;

5.1 States

As the parser reads in token after token, it switches between various states. You can think of a state as point where the parser says, "I have read this particular sequence of input tokens and now I am looking for one of these tokens."

For example, a parser for the C language might be in a state where it has finished reading a complete statement and is ready for the start of a new statement. It therefore expects some token that can legitimately start a statement (e.g. a keyword like IF or WHILE, or the name of a variable for an assignment).

In this state, it reads a token. Say it finds the token corresponding to the keyword IF. It then switches to a new state where it says "I have seen an IF and now I want to see the ( that begins the IF condition." When it finds the (, it switches again to a state that says "I have found IF( and now I want the start of a condition expression."

States break the parsing process into simple steps. At each step, the parser knows what it has seen and what it is looking for next.

YAY assigns numbers to every possible state the parser can enter. The 0th state is always the one that describes the parser's condition before it has read any input. Other states are numbered arbitrarily.

Sometimes a particular input can only be the start of one construct. For example, the FOR keyword in C can only be the start of a FOR statement, and the FOR statement only has one form.

On the other hand, a grammar may have several non-terminal symbols that start the same way. In our sample grammar, all of

          expr '+' expr
          expr '-' expr
          expr '*' expr
          expr '/' expr

start with an "expr". If the parser finds that the input begins with an "expr", the parser has no idea which rule matches the input until it has read the operator following the first "expr".

The parser chooses which state to enter next by looking at the next input token. This token is called the lookahead symbol for that state.

5.2 Diagramming States

YAY uses simple diagrams to describe the various states of the parser. These diagrams show what the parser has seen and what it is looking for next. The diagrams are given in the Parser Description report produced by YAY. (Chapter 7 discusses the Parser Description report in full detail.)

Note: The examples in the rest of this section show what you see in a default Parser Description report. If you specify the +Verbose option when you invoke YAY, the Parser Description report will contain more information than what is shown here.

For example, consider the state where the parser has just read a complete "expr" at the beginning of a larger expression. It is now in a state where it expects to see one of the operators +, -, *, or /, or perhaps the $end marker indicating the end of input. YAY diagrams this state as

          $accept:  expr.$end
          expr:     expr.'+' expr
          expr:     expr.'-' expr
          expr:     expr.'*' expr
          expr:     expr.'/' expr

This lists the possible grammar constructs that the parser may be working on. (The first line $accept stands for the Start symbol.) The "." indicates how much the parser has read so far.

If the lookahead symbol is *, the parser switches to a state diagrammed by

          expr:     expr '*'.expr

In this state, the parser knows that the next thing to come is another "expr". This means that the only valid tokens that can be read next are ( or NUM, since those are the only things that start a valid "expr".

5.3 State Actions

There are several possible actions that the parser can take in a state:

accept the input;
shift to a new state;
reduce zero more input tokens to a single non-terminal symbol, according to a grammar rule;
goto a new state;
raise an error condition.

In order to decide which action to take, the parser checks the lookahead symbol (except in states where the parser can only take one possible action so the lookahead symbol is irrelevant).

This means that a typical state has a series of possible actions based upon the possible values of the lookahead symbol. In YAY output, you might see

          '+'   shift 8
          '-'   shift 7
          '*'   shift 6
          '/'   shift 5
          ')'   shift 9
           .    error

This says that if the parser is in this state and the lookahead symbol is +, the parser shifts to State 8. If the lookahead symbol is -, the parser shifts to State 7, and so on.

The "." in the final line stands for any other token not mentioned in the preceding list. If the parser finds any unexpected tokens in this particular state, it takes the error action.

The sections that follow explain precisely what each state action means and what the parser does to handle these actions.

5.3.1 The Accept Action

The Accept action only happens when the parser is in a state that indicates it has seen a complete input and the lookahead symbol is the end marker $end. When the parser takes the Accept action, "yyparse" terminates and returns a zero to indicate that the input was correct.

5.3.2 The Shift Action

The Shift action happens when the parser is part way through a grammar construct and a new token is read in. As an example, State 4 in our sample parser is diagrammed with

          expr:     expr.'+' expr
          expr:     expr.'-' expr
          expr:     expr.'*' expr
          expr:     expr.'/' expr
          expr:     '(' expr.')'

          '+'   shift 8
          '-'   shift 7
          '*'   shift 6          '/'   shift 5
          ')'   shift 9
           .    error

This shows that the parser will shift to various other states depending on the value of the lookahead symbol. For example, if the lookahead symbol is * the parser shifts to State 6 which has the diagram

          expr:     expr '*'.expr

          NUM   shift 2
          '('   shift 1
           .    error

          expr  goto 11

In this new state, the parser has further shifts it can make, depending on the next lookahead symbol.

When the parser shifts to a new state, it saves the previous state on a stack called the state stack. The stack provides a history of the states that the parser has passed through while it was reading input. It is also a control mechanism as discussed later in this chapter.

Paralleling the state stack is a value stack that records the values of tokens and non- terminal symbols encountered while parsing. The value of a token is the "yylval" value returned by "yylex" at the time the token was read. The value of a non-terminal symbol is the $$ value set by the recognition action associated with that symbol's definition. If the definition didn't have an associated recognition action, the value of the symbol is the value of the first item in the symbol's definition.

At the same time that the Shift action pushes the current state onto the state stack, it also pushes the "yylval" value of the lookahead symbol (token) onto the value stack.

5.3.3 The Reduce Action

The Reduce action takes place in states where the parser has recognized all the items that make up a non-terminal symbol. For example, the diagram of State 9 in our sample grammar is

          expr:     '(' expr ')'.

           .    reduce (7)

At this point, the parser has seen all three components that make up the non-terminal symbol "expr":

          '('
          expr
          ')'

As the line

           .    reduce (7)

shows, it doesn't matter what the lookahead symbol is at this point. The non-terminal symbol has been recognized, and the parser is ready for a Reduce action. (The (7) in the above line means that the parser has recognized the non-terminal symbol defined in rule (7) of the grammar.)

The Reduce action performs a number of operations. First, it pops states off the state stack. If the recognized non-terminal symbol had N components, a reduction pops N states off the stack. In other words, the parser goes back to the state it was in when it first began to gather the recognized construct.

Next, the Reduce action pops values off the value stack. If the definition that is being reduced consisted of N items, the Reduce action conceptually pops N values off the stack. The topmost value on the stack is assigned to $N, the next to $N-1, and so on down to $1.

Once the Reduce action has gathered all the $X values, the parser invokes the recognition action that was associated with the grammar rule being reduced. The recognition action uses the $X values to come up with a $$ value for the non-terminal symbol.

If the non-terminal symbol had no recognition action, or if the recognition action did not set $$, the parser sets $$ to the value of $1.

Lastly, the Reduce action sets things up so that the lookahead token seems to be the non- terminal symbol that was just recognized. For example, it may say that the lookahead token is now an "expr" instead of a token. The real lookahead token is hidden "underneath" the non-terminal symbol.

5.3.4 The Goto Action

The Goto action is a continuation of the Reduce process. Goto is almost the same as Shift--the only difference is that the Goto action takes place when the lookahead symbol is a non-terminal symbol while a Shift takes place when the lookahead symbol is a token.

For example, State 6 in our sample grammar reads

          expr:    expr '*'.expr

          NUM   shift 2
          '('   shift 1
           .    error

          expr  goto 11

The first time the parser enters this state, the lookahead symbol is a token and the parser Shifts on the token into some state where it begins to gather an "expr". When it has a complete "expr", it performs a Reduce action that returns to this state and sets the lookahead symbol to "expr". Now when the parser has to decide what to do next, it sees that it has an "expr" for the lookahead symbol and therefore takes the Goto action and moves to State 11.

Like the Shift action, Goto pushes the current state onto the state stack. Goto also pushes a value onto the value stack; this value is the $$ value for the non-terminal symbol. The $$ value was calculated by the recognition action executed during the preceding Reduce action.

Finally, the Goto action removes the non-terminal symbol from the input stream. This reveals the lookahead token (if any) that the Reduce action hid beneath the non-terminal symbol.

Essentially then, a Goto is like a Shift except that it takes place when you come back to a state after a Reduce action. Also, a Shift is based on the value of a single input token while a Goto is based on a non-terminal symbol.

5.3.5 The Error Action

The parser takes the Error action when it encounters any input token that cannot validly appear in a particular input location. When this happens, the parser attempts to find an error handler to deal with the problem. Error handling is discussed in Chapter 8.

6. Resolving Ambiguities

Suppose we have a grammar with the rules

          expr : expr '-' expr ;
          expr : expl '*' expr ;

and the parser is reading an expression of the form

          expr - expr - expr

The parser reads this token by token, of course, so after three tokens it has

          expr - expr

The parser recognizes this form. In fact, the parser could reduce this right away into a single "expr" according to the given grammar rule.

However, the parser has a problem. At this point, the parser doesn't know what comes next, and perhaps the entire line will be something like

          expr - expr * expr

If it is, normal rules for arithmetic precedence say that the multiplication should be performed before the subtraction, so handling the subtraction first is incorrect. The parser must therefore read another token to see if it is really all right to deal with the subtraction now, or if the correct action is to skip the subtraction for the moment and deal with whatever follows the second "expr".

In terms of parser states, this problem boils down to a choice between reducing the expression

          expr - expr

or shifting and acquiring more input before making a reduction. This is known as a shift/reduce conflict.

Sometimes a parser must also choose between two possible reductions. This kind of situation is called a reduce/reduce conflict.

In case you are curious, there is no such thing as a shift/shift conflict. To see why this is impossible, suppose that we have the following definitions.

          thing : a b
                | a c
                ;
          b : T rest_of_b;
          c : T rest_of_c;

If the parser is in the state where it has seen "a", you have the diagram

          thing : a.b
          thing : a.c

You might think that if the lookahead symbol was the token T, the parser would be confused, since T is the first token of both "b" and "c". However, there is no confusion at all. The parser just Shifts to a state that we could diagram as

          b : T.rest_of_b
          c : T.rest_of_c

6.1 Selections and Rejections

A selection consists of zero or more tokens enclosed in square brackets. The following input shows a simple example of how a selection can resolve conflicts between two rules. The following two rules will typically produce a Reduce/Reduce conflict:

          a1 : a b
               { /* Action 1 */ } ;
          a2 : a b
               { /* Action 2 */ } ;

However, the addition of a selection changes the situation:

          a1 : a b ['+' '-']
               { /* Action 1 */ } ;
          a2 : a b
               { /* Action 2 */ } ;

The first rule says that if the token following "b" is one of the tokens inside the square brackets, the parser should prefer to use the grammar rule for "a1". If the token following "b" is not one of the given tokens, the parser should prefer to use the rule for "a2". In this way, the conflict between the two rules is resolved--the selection tells which one to select, depending on the value of the lookahead token.

Notice that a selection states that a rule should be used when the next token is one of the ones listed in the brackets and should not be used if it is not in the brackets.

A rejection is the opposite of a selection. It consists of a minus sign followed by a selection, as in

          -[token1 token2 ... ]

For example, consider the following rule:

          expr: expr '+' expr -['*' '/']

The rejection states that the parser should prefer not to reduce by this rule if the lookahead token is * or /. If the input is

          A + B + C

the parser can reduce A+B (because the lookahead token is +). However, if the input is

          A + B * C

the parser should not reduce A+B because the lookahead token is *. The parser will continue gathering input, presumably reducing B*C into an expr, and then performing the addition.

With both selections and rejections, the lookahead token is merely used to determine which rule to select. It is not part of the rule itself. For example, suppose you have

          a1 : a b ['+' '-'] ;
          a2 : a b ;
          xx : a1 op expr ;

and suppose you have an "a", a "b", and + as the lookahead token. The + indicates that the "a" and "b" should be reduced to "a1". The parser does this and finds that the "a1" is part of the "xx" rule. The + lookahead token is then associated with the "op" symbol in the "xx" rule. In other words, the selection does not "use up" an input token; it just looks at the token value to help resolve a conflict.

You can use the $end symbol in a selection or rejection to indicate end-of-file in the input being parsed. For example,

          statement : line ['\n' $end]

states that this rule is preferred if "line" is followed by a new-line character or the end of the file.

6.1.1 Embedded Selections and Rejections

Selections and rejections can be put in the middle of rules as well as on the end. For example, you could write

          expr : expr ['+' '-'] op expr
                 { /* Action 1 */ }
               | expr op expr
                 { /* Action 2 */ } ;

This states that if the first "expr" is followed by + or - you want to use the first rule; otherwise, you want to use the second. The selection does not use up the + or - token: you still need a symbol ("op") to represent such tokens.

Selections and rejections that appear in the middle of a rule are implemented in the same way as multiple actions, using dummy rules. The above example would result in something like

          $23 : ['+' '-'] ;
          expr : expr $23 op expr
                 { /* Action 1 */ }
               | expr op expr
                 { /* Action 2 */ } ;

(where the 23 in $23 is just a number we chose at random). The dummy rule that is created is a null string with the selection on the end. The first token for "op" will be the + or - that was the lookahead token in rule $23.

If a selection or rejection in the middle of a rule is immediately followed by an action, only one dummy rule is created to handle both the action and the selection/rejection.

In most cases, a selection or rejection counts as a $N symbol, but with no associated value. For example, in

          expr : expr ['+' '-'] op expr

you have

          $1 -- first "expr"
          $2 -- no value
          $3 -- "op"
          $4 -- second "expr"

If a selection or rejection is followed by an action, the selection/rejection and the action count as a single $N symbol whose value is equal to the $$ value of the action. For example, in

          expr : expr ['+' '-'] {action} op expr

you have

          $1 -- first "expr"
          $2 -- $$ of action
          $3 -- op          $4 -- second "expr"

6.1.2 Alternate Forms

Selections may also be specified in the form

          [^ token1 token2 ... ]

This selection stands for all tokens except the ones listed in the square brackets. For example,

          [^ '*' '/' ]

is a selection standing for any token except '*' and '/'.

Similarly, rejections may have the form

          -[^ token1 token2 ...]

This rejection rejects all tokens except the ones listed in the square brackets.

6.1.3 Selections, Rejections, and Precedences

Selections and rejections can be used to represent operator precedence. For example, consider the following

          expr: expr '+' expr  ['+' '-']  -['*' '/']
          expr: expr '-' expr  ['+' '-']  -['*' '/']
          expr: expr '*' expr  ['+' '-' '*' '/'] -[]
          expr: expr '/' expr  ['+' '-' '*' '/'] -[]

The first rule states that you may reduce an addition operation if it is followed by another addition or a subtraction, but not if it is followed by a multiplication or a division. The second rule states the same thing about subtraction. The third rule states that you can reduce a multiplication if it is followed by any arithmetic operator, but not if it is followed by some other token. The fourth rule states the same thing about division.

Operator precedence and rule precedence can always be represented by appropriate selections and rejections. Therefore, YAY implements these precedences by generating selections and rejections, and adding them to rules which have precedences.

As demonstrated by the preceding example, a %left precedence is implemented with a selection that lets you reduce the expression if it is followed by an operator of lower or equal precedence and with a rejection that prevents you from reducing the expression if it is followed by an operator of higher precedence. %right and %nonassoc precedences are implemented in comparable manners.

You may combine selections, rejections, and explicit precedences (using %prec) in the same rule, provided they do not conflict. For example, a rule that has a precedence may contain a selection or rejection based on a token that has no precedence (e.g. a semicolon ';' used to separate one statement from another). However, a rule that contains a precedence may not contain a selection or rejection based on tokens that have precedences of their own. In general, you can explicitly select any token that is not rejected by precedence, and explicitly reject any token that is not selected by precedence.

6.1.4 Priority Classes

Reduce and Shift operations can be divided into five priority classes:

High Priority: Reduce operations which have a selection that contains the lookahead token as a member.
Middle-High Priority: Reduce operations which have a rejection but no selection, and the lookahead token is not a member of the rejection.
Middle Priority: All Shift operations, and Reduce operations which do not fit in any other priority class.
Middle-Low Priority: Reduce operations which have a selection but no rejection, and the lookahead token is not a member of the selection.
Low Priority: Reduce operations which have a rejection that contains the lookahead token as a member.

The priority of a given operation depends on the value of the lookahead token. If a state has several possible Shift and/or Reduce operations associated with the same lookahead token, the parser chooses the operation with the highest priority. If there are several possible operations that all have the same high priority, YAY reports the situation as a Shift/Reduce or Reduce/Reduce conflict. Conflicts only occur when YAY cannot resolve problems by using priority class rankings.

Note that priority decisions are made on the basis of the lookahead token. YAY considers each relevant lookahead token, one at a time, and often chooses different actions for different tokens.

Because of priority classes, [^X] and -[X] have slightly different effects in Reduce actions. [^X] is a selection for any token but X, while - [X] is a rejection for X itself. Therefore: [^X] If the lookahead token is X, the Reduce has Middle-Low priority (lookahead is not in selection). If the lookahead token is not X, the Reduce has High priority (lookahead is in selection). -[X] If the lookahead token is X, the Reduce has Low priority (lookahead is in rejection). If the lookahead token is not X, the Reduce has Middle-High priority (lookahead is not in rejection). Such differences will not affect the outcome for a single Shift/Reduce conflict, but can affect how YAY resolves a set of Reduce/Reduce conflicts.

6.1.5 Using Selections and Rejections

Selections and rejections can be used to resolve most conflicts. Indeed, there may be cases where the most practical course of action is to write a number of conflicting rules that contain selections/rejections to resolve the conflicts, as in

          expr : expr ['+' '-'] op expr
               | expr ['*' '/' '%'] op expr
               | expr ['&' '|'] op expr
                        ...

You can use empty selections and rejections to assign explicit priorities to rules. For example, consider

          symbol : def [^ ]

The selection [^ ] represents any token not listed inside the brackets, which means any token at all. Therefore, no matter what the lookahead token is, it will be inside the selection; therefore, reducing by this rule will always have High priority. Similarly,

          symbol : def [ ]

will always have Middle-Low priority--the lookahead token will never be in the selection because the selection contains no tokens. Using similar tricks with rejections, you can specifically assign a rule Low priority or Middle-High priority.

There are times when embedding [^ ] into the middle of a rule will introduce reduce/reduce conflicts. You may be able to get rid of these conflicts by listing all the relevant tokens inside the brackets.

6.1.6 A Typical Example

Precedence is well-suited for normal expression operators, but using it in other contexts often produces contorted hard-to-understand grammars. For example, consider the following:

          statmt:  IF '(' cond ')' statmt
              |    IF '(' cond ')' statmt ELSE statmt ;

Given this rule, how should the parser interpret the input

          IF ( cond1 ) IF ( cond2 ) statmt1 ELSE statmt2

There are two equally valid interpretations of this input:

          IF ( cond1 ) {
               IF ( cond2 ) statmt1
               ELSE statmt2
          }

and

          IF ( cond1 ) {
               IF ( cond2 ) statmt1
          }
          ELSE statmt2

You could solve this problem by assigning a precedence to the IF and IF- ELSE statements (or more precisely, to the keywords IF and ELSE). However, we don't normally consider such things to have a precedence...not in the same way that expression operators have precedence.

Thus consider what happens at the point when the parser has read

          IF ( cond1 ) IF ( cond2 ) statmt1

and has just picked up ELSE as the lookahead symbol.

It can immediately Reduce the
```
          IF ( cond2 ) statmt1
```
using the first definition of "statmt" and obtain
```
          IF ( cond1 ) statmt ELSE ...
```
thereby associating the ELSE with the first IF.
It can Shift, which means ignoring the first part (the IF with "cond1") and going on to handle the second part, thereby associating the ELSE with the second IF.

In this case, most programming languages choose to associate the ELSE with the second IF--they want the parser to Shift instead of Reduce. You can make this happen by rewriting the grammar to use a simple selection:

          statmt: IF '(' cond ')' statmt -[ELSE]              |   IF '(' cond ')' statmt ELSE statmt
              ;

The rejection -ELSE states that when the lookahead token is ELSE, the parser should prefer to interpret the input as an IF-ELSE statement rather than an ELSE-less IF. This resolves the ambiguity and eliminates the conflict.

6.2 Arbitration Rules

When precedence, selections, and rejections do not provide enough information to remove ambiguities, YAY-produced parsers use the following rule to resolve Shift/Reduce conflicts.

Arbitration Rule 1:: If there is a Shift/Reduce conflict, the default action is to Shift. The conflict is reported in the YAY output so you can check that Shifting is actually what you want.

This is known as an arbitration rule because it helps your parser arbitrate between conflicting possibilities. The rule is only used in situations where other rules cannot resolve the conflict.

In a similar vein, YAY-produced parsers use the following rule to resolve reduce/reduce conflicts.

Arbitration Rule 2:: If there is a Reduce/Reduce conflict, the parser Reduces by the rule that was given first in the rules section of the YAY input. Again, the conflict is reported in the YAY output to let you check that the choice of Reduction is correct.

It is important to stress that Shift/Reduce and Reduce/Reduce conflicts only occur if precedence, selections, and rejections are unable to resolve possible ambiguities. Therefore, the arbitration rules only come into play where precedence, selections, and rejections are either absent or insufficient to solve the problem.

The arbitration rules are simple to state, but they can have complex repercussions if the grammar is non-trivial. If the grammar is sufficiently complicated, these simple rules for resolving conflicts may not be capable of handling all the necessary intricacies in the way you want. You should pay close attention to all conflicts noted in the parsing table report produced by YAY and should ensure that the default actions taken by the parser are the desired ones. Many conflicts can be eliminated by adding appropriate selections or rejections.

Note: Selections and rejections are sufficient to resolve ambiguities in most grammars, without resorting to the arbitration rules stated above. However, there may be instances where you do not want to use selections and rejections. For example, there may be situations in which the choice between actions depends on a greater degree of lookahead than just a single lookahead token or symbol. In such situations, it may be simpler to let the arbitration rules resolve the conflicts and backtrack later on with YYPOP (as described in Section 10.1). In the final version of the grammar, however, it is still best to remove ambiguities with explicit selections or rejections, rather than relying on the arbitration rules.

6.3 Conflicts from Embedded Actions

As noted in Section 3.2.6, YAY handles embedded actions by changing the form of the given grammar rule and creating grammar rules for dummy symbols. For example,

          a : A1 {action1} A2 {action2} A3 {action3};

might be changed to the rules

          $21 : /* null */ {action1} ;
          $22 : /* null */ {action2} ;
          a : A1 $21 A2 $22 A3 {action3};

This technique can introduce conflicts. For example, suppose you have the definitions

          a : A1 {action1} A2 X;
          b : A1 A2 Y;

These are changed to

          $50 : /* null */ {action1};
          a : A1 $50 A2 X;
          b : A1 A2 Y;

The definitions of "a" and "b" give a shift/reduce conflict because the parser can't tell whether A1 followed by A2 has the null string for $50 in the middle. It has to decide whether to Reduce $50 or to Shift to a state diagrammed by

          b : A1 A2.Y

There is another situation to watch for. Consider the grammar

          a : A1 c A2 X ;
          b : A1 c A2 Y ;
          c : /* nothing */ {action} ;

You might think that this is equivalent to

          a : A1 {action} A2 X;
          b : A1 {action} A2 Y;

but it is not. The first will not produce a conflict, but the second one will. The second is converted into

          a : A1 $50 A2 X;
          b : A1 $51 A2 Y;
          $50 : {action} ;
          $51 : {action} ;

even when the action is the same for both "a" and "b". If the parser finds an A1 followed by a null string followed by A2, it doesn't know if it should interpret the null string as $50 or $51; therefore, it gets a conflict. Obviously, it is better to use the first format, where the identical actions are associated with a separate rule.

6.4 Null Rules

A null rule is a rule that defines a symbol as a null string. Null rules may be specified explicitly in the grammar; they may also be generated implicitly when YAY creates grammar rules for dummy symbols.

When your grammar uses embedded rules, it may happen that there are several null rules associated with the same symbol. YAY merges all such rules into a single rule, provided that the rules have no actions and have the same set of selections, rejections, and precedence. For example, consider the following rules:

          a : b
            | c ;
          b : X [A] Y ;
          c : X [A] Z ;

At first glance, there would seem to be a Reduce/Reduce conflict between the embedded rules in b and c. However, there is no actual conflict because the rules change into

          a :   b
          a :   c
          $4:   [A]
          b :   X $4 Y          c :   X $4 Z

The two selections for A are merged into a single rule.

7. YAY Output

YAY can produce several output files. Options on the YAY command line dictate which files are actually generated.

The most important output file is the one containing source code that can be compiled into the actual parser. The name of this file is specified with the Parser=file command line option.

Another possible output file contains manifest definitions. The name of this file is specified with Header=file on the command line. This file is a distillation of the declarations section of the YAY input. For example, a %token directive like

          %token IF

is converted to a statement of the form

          #define IF 257

in the C version of the manifests. By including this file with

          #include "filename"

separately compiled modules can make use of all the pertinent definitions in the YAY input.

The third output file that YAY can produce is called the Parser Description. The name of the file is specified with Description=file on the command line. The Parser Description is split into several sections:

a summary of the grammar rules;
a list of variables (non-terminal symbols) defined by the rules;
a list of the tokens used in the rules, including special symbols like $catch and error (discussed in Section 8.1 and 8.2);
a description of each state for the parser;
a list of statistics for the parser generated by YAY.

The sections that follow show what the Parser Description looks like for the following grammar:

          %token IF ELSE A
          %%
          stmt : IF stmt ELSE stmt
               | IF stmt
               | A
               ;

7.1 Rules Summary

The rules summary section of the Parser Description begins with the command line used to invoke YAY. This is intended to serve as a "heading" for the output material. We'll use the command line

          yay infile.y description=out +verbose

YAY input is in the file infile.y and output is written to the file out. We have specified the +Verbose option because this provides the largest amount of output. See "expl yay" for more information on the YAY command line.

Next comes a summary of the grammar rules. In our example, we will have

          Rules:
           (0)     $accept:  stmt $end
           (1)     $error:  $error
           (2)     stmt:    IF stmt ELSE stmt
           (3)     stmt:    IF stmt
           (4)     stmt:    A

The 0th rule is always the definition for a symbol named $accept. This describes what a complete input looks like: the Start symbol followed by the end marker. Rule 1 is always a definition for $error, used for implementing certain types of grammar rules. Other rules are those given in the grammar.

YAY puts a formfeed character on the line after the last grammar rule so that the next part of the Parser Description starts on a new page.

Note: YAY handles precedences by adding appropriate selections and/or rejections to rules. The rule summary shows the rules in their final form, with the added selections/rejections.

7.2 Variable Descriptions

The Parser Description output lists all the variables (non-terminal symbols) defined in the grammar. For example, here is the list of variable descriptions from our sample:

          Variables and their defining rules:
          V1 $error   1
          V2 stmt     4 3 2

There is a separate line for each symbol. The line gives the name of the symbol, then lists the grammar rules that define that symbol. Each line begins with a label (V1, V2, and so on) to identify these as variables. These labels are useful when parser macros are used (as described in Chapter 12).

YAY puts a formfeed character on the line after the last line so that the next part of the Parser Description starts on a new page.

7.3 Terminal Symbol Descriptions

The Parser Description output contains a list of all the tokens (terminal symbols) explicitly mentioned in the grammar rules. For example, here is the list of token descriptions from our sample:

          Terminals and their precedence and lexical value
          T1 error       256
          T2 $catch      257
          T3 IF          258
          T4 ELSE        259
          T5 A           260

There is a separate line for each symbol. The line gives the name of the symbol and the numeric value. Each line begins with a label (T1, T2, and so on) to identify these as tokens.

If a token has a precedence and associativity, the description line specifies this information as well. For example, the following lines appear in the terminal symbol descriptions for the sample grammar of Chapter 4:

          T4  '+'    L1   43
          T5  '-'    L1   45
          T6  '*'    L2   42
          T7  '/'    L2   47

The notation L1 indicates tokens which are left-associative with a precedence of 1. The notation L2 indicates tokens which are left-associative with a precedence of 2. Right-associative tokens are marked with R followed by the precedence and non-associative tokens are marked with N followed by the precedence. The lowest precedence is always 1 and increasing levels of precedence have increasing numeric values.

YAY puts a formfeed character on the line after the last line so that the next part of the Parser Description starts on a new page.

7.4 State Descriptions

The Parser Description output contains complete descriptions of every possible state. For example, here is the description of one state from our sample grammar.

          State 2
                   stmt:    IF.stmt ELSE stmt
                   stmt:    IF.stmt
                   stmt:   .IF stmt ELSE stmt
                   stmt:   .IF stmt
                   stmt:   .A

              IF     shift 2
              A      shift 1
              .      error

              stmt    goto 4

By now, this sort of diagram should be familiar to you. The numbers after the word "shift" indicate the state to which the parser Shifts if the lookahead symbol happens to be IF or A. If the lookahead symbol is anything else, the parser raises the error condition and starts error recovery.

If the parser pops back to state 2 by means of a Reduce action, the lookahead symbol is now "stmt" and the parser will Goto state 4.

As another example of a state, here is State 1.

          State 1
           (4)     stmt:    A.

              .      reduce (4)

This is the state that is entered when an A token has been found. The (4) on the end of the first line is a rule number. It indicates that this particular line sums up the whole of the fourth grammar rule that was specified in the YAY input. The line

              .    reduce (4)

indicates that no matter what token comes next, we can reduce this particular input using grammar rule (4) and say that we have successfully collected a valid stmt. The parser will perform a reduction by popping the top state off the stack and setting the lookahead symbol to stmt.

It is important to distinguish between

            A  shift 1

in State 2 and

            .  reduce (4)

in State 1. In the Shift instruction, the number that follows is the number of a state. In the Reduce instruction, the number that follows is the number of a grammar rule (using the numbers given to the grammar rules in the first part of the Parser Description). The Parser Description always encloses rule numbers in parentheses, and leaves state numbers as they are.

The complete state descriptions for the grammar are given below.

          State 0
                   $accept: .stmt $end
                   stmt:   .IF stmt ELSE stmt
                   stmt:   .IF stmt
                   stmt:   .A

              IF     shift 2
              A      shift 1
              .      error

              stmt    goto 3

          State 1
           (4)     stmt:    A.

              .      reduce (4)

          State 2
                   stmt:    IF.stmt ELSE stmt
                   stmt:    IF.stmt
                   stmt:   .IF stmt ELSE stmt
                   stmt:   .IF stmt
                   stmt:   .A

              IF     shift 2
              A      shift 1
              .      error

              stmt    goto 4

          State 3
                   $accept:  stmt.$end

              $end   accept
              .      error

          State 4, LALR(1) S/R Conflict(s)
                   stmt:    IF stmt.ELSE stmt
           (3)     stmt:    IF stmt. / [$end ELSE]

            Shift/reduce conflict (5,3) on ELSE
              $end   reduce (3)
              ELSE   shift 5
              .      error

          State 5
                   stmt:    IF stmt ELSE.stmt
                   stmt:   .IF stmt ELSE stmt
                   stmt:   .IF stmt
                   stmt:   .A

              IF     shift 2              A      shift 1
              .      error

              stmt    goto 6

          State 6
           (2)     stmt:    IF stmt ELSE stmt.

              .      reduce (2)

The parser always begins in State 0, i.e. in a state where no input has been read yet. An acceptable input is a stmt followed by the end marker. When a stmt has been collected, the parser goes to State 3. In State 3, the required end marker $end indicates that the input should be accepted. If anything else is found, it is excess input and means an error.

In State 4, the rule labeled (3) has

          / [ $end ELSE ]

on the end. This just means that the parser expects to see one of these two tokens next. The / character distinguishes this construction from a user-specified selection preference. Such constructions are typically only generated when the state requires an extra degree of lookahead.

7.4.1 Renumbering States and Rules

During execution, YAY typically renumbers the states and rules of your program for more efficient execution. This process is mostly transparent to the programmer unless you are debugging the program while it runs. For more information about analyzing your program after the states and rules have been renumbered, see the description of YYDEBUG in Section 13.5.1.

7.4.2 Deleted Items

As discussed in Section 6.1.4, YAY ranks Shift and Reduce operations into five priority classes. In states where some operations conflict, YAY attempts to resolve the conflict by choosing the operation with the highest priority. As a result, some states may have operations which are never used because the operations are overridden by higher priority operations.

The output for YAY displays these never-used operations in the state descriptions. The operations are labeled Deleted Items. Since the parser never uses these items, the parser does not make entries for the items in its data tables. The items are listed for the programmer's information.

When YAY deletes an operation from one state, it checks the effects of the deletion on other states. For example, if a particular Shift is never used, YAY checks the state that the unused operation would Shift to; this may lead to deleting one or more operations from the new state, since there are now fewer ways to reach this new state.

It is not unusual for grammars to have a few deleted items in some states or even some deleted states (where all items are deleted). However, if there are a large number of deleted items in a particular state, it may be a sign that your grammar isn't working the way you expect. It also works the other way around: if your grammar isn't working properly, examining which items are deleted may help you determine what's going wrong.

7.5 Conflict Summary

As noted in the sample output, there is a shift/reduce conflict in State 4. If an ELSE is encountered while the parser is in this state, the parser doesn't know whether to shift (into State 5) or to reduce using rule (3). Thus YAY prints the message

          Shift/reduce conflict (5,3) on ELSE

After the state descriptions, YAY prints a summary of all such conflicts. With our sample grammar, the output contains

          Conflicts:
              State  Token      Action
                  4  ELSE       shift 5
                  4  ELSE       reduce (3)

Each line in this summary gives the number of the state in which the conflict occurs, the token(s) that cause the conflict, and the conflicting actions.

The conflict report shows you what action the parser takes in case of a conflict--the parser always takes the first action shown in the report. This action is chosen in accordance with the arbitration rules.

7.6 Parser Statistics

The last section of the Parser Description is a set of statistics summarizing YAY's work. Here are the statistics we got when we ran our sample grammar through YAY.

6 rules, 6 tokens, 3 variables
1 shift/reduce conflict
Closure:  2 passes
States: 7, 4 recreated, 2 overallocated
Items:  15, examined  18 (9 kernel), average 2.57
              (1.29 kernel) per state
LALR(1) lookahead: 1 calls; recursions: 2 lalr1, 12 epred
Table entries: 0 shift, 0 goto, 8 exception
Offset table entries: 5 shift, 6 goto, 2 exception
Error default elimination saved 37% of total table space
Optimizer: in 0, out 0
Size of tables:  10 shorts
Max memory = 19K
5.00 seconds, final mem = 8

Some of these values are machine independent (e.g. the number of rules), others are machine dependent (e.g. the amount of memory used), and some can be different every time you run the job (e.g. time elapsed while YAY was running).

In general, these statistics are only of interest to the people who maintain the YAY software.

8. Error Handling

If a piece of input is invalid, the parser can do nothing with it. Except in extreme cases, however, it is inappropriate for the parser to stop all processing as soon as an error is found. Instead, the parser should skip the invalid input and resume parsing as soon after the error as possible. In this way, the parser can find many syntax errors in one pass through the input, saving time and trouble for the user.

YAY therefore tries to generate a parser that can "restart" as soon as possible after an error occurs. YAY does this by letting you specify points at which the parser can pick up after errors. You may also dictate what special processing should take place if an error is encountered at one of these points.

8.1 The "$catch" Symbol

The $catch symbol provides an easy way to deal with erroneous input. When your parser finds a lookahead token or symbol which is not valid in the current state, the parser checks to see if this state can Shift on the $catch symbol. If so, the parser Shifts on $catch. This takes the parser to a new state where it will attempt to use the same lookahead token.

As an example of how this can be useful, consider the following grammar:

          stat : stat_body ';'
                   { /* action */ }
               | stat_body $catch
                   { printf("Missing semicolon.\n"); }
               ;

The first line says that a stat consists of a stat_body followed by a semicolon. If the parser finds a stat_body followed by a token that cannot be used in any other way, it Shifts on $catch. This leads to the recognition action that displays a message telling what went wrong.

When the parser Shifts on $catch, the action pushes the current state on the stack. To keep the state stack and value stack in synch with each other, the parser also pushes a "junk" value onto the value stack.

8.2 The "error" Symbol

The error symbol is a more general way to signify an erroneous input. You may put error into a grammar rule to indicate where you will allow error recovery to take place. For example, you might write

          statement: error
              | /* other definitions of a statement */;

This tells YAY that errors may occur in statements, and that after an error, the parser is free to restart parsing at the end of a complete statement.

The difference between $catch and error is that the parser only Shifts on $catch if the error was discovered while the parser was in the current state. If the error was discovered while the parser was in some other state, the parser will not Shift on $catch but may Shift on error.

As with $catch, when the parser Shifts on error, it pushes a "junk" value onto the value stack so that the value stack and the state stack retain the same number of entries.

Grammar rules may intermix uses of $catch and error. $catch takes precedence over error. For example, consider a state diagrammed as:

          statement : . $catch [ ';' ] { action1 }
          statement : . $catch { action2 }
          statement : . error { action3 }
          statement : . statement_definition

Suppose the lookahead token is invalid. This means that the parser Shifts on $catch to the state

          statement : $catch. [ ';' ] { action1 }
          statement : $catch.  { action 2 }

If the lookahead token is ';', the parser Reduces using the first rule and executes action1. If the lookahead token is anything else, the parser typically Reduces using the second rule and executes action2 (although there are circumstances where it could take the Error action instead).

Now, consider the original state again:

          statement : . $catch [ ';' ] { action1 }
          statement : . $catch { action2 }
          statement : . error { action3 }
          statement : . statement_definition

If the parser reaches this state because of an error discovered in some other state, the parser does not Shift on $catch. Instead, it Shifts on error into the state

          statement : error. { action3 }

The parser will then Reduce and execute action3.

Note: Although error is a token, it can be thought of as a non-terminal symbol representing an arbitrary string of "invalid" tokens up to, but not including, the next valid token. More specifically, error represents tokens starting with the lookahead token of the state with the first Shift on error, up to (but not including) the first token Shifted after the Shift on error. The parser could only make sense of this string of tokens as part of the error, even if some of the string had been parsed before the error was found. For more information on error Shifting, see Section 8.4.

8.3 Default Actions

YAY picks a default action for each state. This is the action that the parser takes if the lookahead symbol is not one of the tokens or symbols that has a specific action associated with it. YAY takes the following steps to determine the default action for a particular state:

If the state can Shift on $catch, the default action will be the Error action.
Otherwise, YAY checks whether it can Reduce by a %default rule. A default rule has the form
```
          symbol: definition %default { action } ;
```
As shown, the keyword %default appears at the end of the body of the rule. If the current state contains a Reduce operation associated with a %default rule, that Reduce operation is chosen as the default action.
As an example of when to use %default, suppose you have a situation where you want to complete a particular action before an error is triggered. You can use %default label the rule with the desired action. Thus if an invalid lookahead token is encountered in input, the desired Reduce becomes the default action for the state. Thus the action code is executed before the error condition is triggered.
If the state cannot Reduce by a %default rule, YAY checks whether the state can Shift on error. If so, the default action for the state will be the Error action.
Otherwise, YAY checks whether the state has a Reduce operation that appears to lead to a state that can Shift on $catch. If so, this Reduce operation is chosen as the default action.
The chosen Reduce operation does not have to lead directly to the state that Shifts on $catch. For example, you might Reduce several times in a row before reaching the state that Shifts on $catch. If several Reduce operations lead to a state that Shifts on $catch, you can use selections and/or rejections to control which Reduce is chosen.
If the Reduce operation does not lead directly to a state that can Shift on $catch, there is no guarantee that the parser will actually reach such a state. After making the first Reduce, the parser will be in a new state. Depending on the actions offered by the new state, the parser may not keep reducing all the way to the state that can Shift on $catch.
Otherwise, YAY checks whether the state has a Reduce operation that appears to lead to a state that can Shift on error. If so, this Reduce operation is chosen as the default action.
Again, the chosen Reduce operation does not have to lead directly to the state that Shifts on error. If several Reduce operations lead to a state that Shifts on error, you can use selections and/or rejections to control which Reduce is chosen. Also, there is no guarantee that the parser will actually Reduce all the way to the state that can Shift on error.
Otherwise, YAY checks whether the state has any other Reduce operations associated with it. If so, YAY may choose one of these Reduce operations as the default action, depending on various heuristic criteria.
Otherwise, YAY sets the default action for the state to be the Error action.

As an example, you may see the following in the YAY Parser description report for a grammar:

          ';'      shift 1
          $catch   shift 2
          
          .        error

This says that if the lookahead token is ';', the parser should Shift to state 1. If the lookahead token is anything else, the parser takes the Error action. The Error action sees that this state can Shift on $catch, and therefore Shifts to state 2.

You may also see

          .     reduce (3)

This is the usual form when the default action is a Reduce operation. If the lookahead symbol is not one of the values recognized as valid in the current state, the parser will Reduce using grammar rule 3.

It is important to understand the implications of YAY choosing a Reduce operation as the default action for a state. If the parser reaches that state and finds an invalid lookahead token, the parser will perform the default Reduce rather than start any kind of error processing. In fact, there may be several Reduce operations before the parser reaches a state where the default action is Error rather than Reduce.

8.3.1 Conflicts in Default Actions

Choosing a default action for a state may result in Reduce/Reduce conflicts. For example, suppose there is no %default rule for a state and the state does not have a Shift on error or $catch. If there are several Reduce operations which lead directly or indirectly to a state that can Shift on $catch, there will be Reduce/Reduce conflicts between these operations. However, YAY does not report these conflicts; it simply chooses one of the possible Reduces using internal heuristics.

In this situation, selections/rejections of the form

          [$catch]
          [^ $catch]

may specify which Reduction to choose, when there are several that lead to a state that can Shift on $catch. Similarly, selections/rejections of the form

          [error]          [^ error]

may specify which Reduction to choose when there are several that lead to a state that can Shift on error.

Selections and rejections are only useful when there are several possible candidates of the same type (for example, several Reduces that all lead to states that can Shift on $catch). If there is only one Reduce operation leading to a state that can Shift on $catch, the parser will take that Reduce, even if you try to use

          [^ $catch]

to avoid taking that Reduce. Similarly, you cannot use

          [^ error]

to avoid taking a Reduce that leads to a state that can Shift on error if there is only one such Reduce.

8.3.2 %nodefault

As discussed in a previous section, YAY may designate a Reduce operation as the default action for a state. In some situations, you may want to make sure that the parser does not pick a particular Reduce operation as the default action. For example, suppose that the recognition action associated with Rule X has a number of side-effects (like freeing allocated memory) which should only be performed when you are sure you are working with correct input. In this case, you can use the keyword %nodefault to indicate that Reducing by the rule should never be chosen as the default action. You specify %nodefault at the end of the body of the rule, as in

          symbol: definition %nodefault { action } ;

When a rule is marked with %nodefault, YAY will never choose a Reduce operation using that rule as the default action for a state, even if there are no alternative Reduce operations or Shift on $catch or error. If there are no alternatives, the default action for the state will be the Error action.

8.4 The Error Action

As discussed in the previous section, every state has an associated default action. This is the action that will be taken if the lookahead token or symbol is not one of the values recognized as valid in the current state. The default action may be a Reduce, or the Error action.

The Error action has the following steps:

If the current state can Shift on $catch, the parser performs that Shift.
Otherwise, the parser executes the macro call
```
          YYSERROR("Syntax error");
```
By default, this calls a user-defined function to display a diagnostic message. For more details, see Section 8.7.
(c) The parser enters error recovery mode. It does this by setting the variable "yyerrflag" to the value YYE_SHIFT(0). YYE_SHIFT is a macro that takes an integer argument and returns a code value for "yyerrflag".
If the current state cannot Shift on error, the parser pops the current state off the state stack, and pops the corresponding value off the value stack. It keeps on popping until it finds a state that can Shift on error.
The parser Shifts on error. It also pushes a "junk" value onto the value stack, so the state and value stacks still have the same number of entries.

In some situations, the Error action pops all the way to the bottom of the state stack without finding a state that can Shift on error. For example, the grammar may have no provisions for error recovery. In this case, "yyparse" simply terminates and returns the value 1 to its caller.

Notice that the parser does not go into error recovery mode if it can Shift on $catch when the error is first detected.

Also notice that the parser always Shifts on error or $catch. It never Reduces on error or $catch, even if the grammar has a state where error or $catch is associated with a Reduce action. However, you can use %default to force a Reduce action to happen in a situation that would otherwise trigger an error. This is because %default tells YAY to take the associated Reduce action even if it is also possible to Shift on error.

Rationale: Section 8.2 suggested that the error symbol corresponds to an "invalid" string of input tokens. The parser may not discover that a given string of tokens is invalid until it has parsed partway through the string. Loosely speaking, the process of popping back, discarding states and values, corresponds to throwing away the first part of the string: the part that was parsed before the error was detected. When you reach a state that can Shift on error, the assumption is that you have reached the beginning of the invalid input and can begin error recovery (although subsequent parsing may reveal that you haven't gone back far enough yet, so you have to pop back farther).

8.4.1 Error Recovery Mode

After the parser Shifts on error, the lookahead symbol is whatever token caused the error condition in the first place. The parser then tries to proceed with normal processing. However, the parser remains in error recovery mode for the time being, as described in the rest of this session.

Therefore, suppose the parser goes into error recovery mode and Shifts on error into a new state. At this point, the value of "yyerrflag" is YYE_SHIFT(0). The lookahead token has not changed; it is the same token that resulted in the Error action in the previous state.

From the new state, parsing continues. As always, the possible actions are Shifts, Reduces followed by Gotos, and Error actions.

Whenever the parser Shifts, it changes the value of "yyerrflag". For example, after one successful Shift, "yyerrflag" is set to YYE_SHIFT(1). After another Shift, it is set to YYE_SHIFT(2). After a total of three successful Shifts, "yyerrflag" is set to YYE_NONE. At this point, the parser leaves error recovery mode and proceeds with normal parsing.

Whenever the parser Reduces, it performs the Reduce action as usual, followed by a normal Goto. Reduces and Gotos have no effect on error recovery mode.

If the parser takes an Error action, the result depends on the value of "yyerrflag":

If "yyerrflag" is still YYE_SHIFT(0), there has been no Shift since the last Error action. The parser discards the current lookahead token and reads a new one; the parser does not change state. If the new lookahead token also results in an Error action, the parser discards the token and reads another new one. In this way, the parser stays in the same state, discarding tokens and reading new ones until it finds a token which does not result in an Error action in this state. After that parsing continues normally. "yyerrflag" retains its value of YYE_SHIFT(0), and the parser remains in error recovery mode.
If "yyerrflag" is YYE_SHIFT(1) or YYE_SHIFT(2), the parser has Shifted at least once since entering error recovery mode. In this case, the parser checks whether the current state can Shift on error; if so, the parser takes the Shift. Otherwise, the parser pops back through the state stack one state at a time until it finds a state that can Shift on error. After Shifting on error, the parser resets "yyerrflag" to YYE_SHIFT(0) and proceeds with normal processing.

Notice that the parser does not Shift on $catch once it has entered error recovery mode. For example, suppose the parser is already in error recovery mode and executes the Error action in a state. Even if the state can Shift on $catch, the parser will not do so.

Also notice that the parser does not call YYSERROR if syntax errors occur during error recovery mode. YYSERROR is only called for "new" errors: ones encountered when "yyerrflag" is YYE_NONE.

If a rule ends in the error token and is labeled as %nodefault, the general effect will be that the parser will stay in that state until it finds a valid token. %nodefault prevents the state from Reducing as the default action, so the default action must be Error. This means that the state will either Shift or Reduce on a valid token, or discard tokens until a valid one is found.

8.4.2 Using yyerrflag

An action can check "yyerrflag" to see if there has been a recent error. For example, if an action encounters a semantic error, it can check "yyerrflag" to see if it equals YYE_NONE. If not, the parser is in error recovery mode, and the semantic error is almost certainly a result of the previous syntax error. Using this information, the action may choose to modify or suppress the semantic error message.

Actions should not change the value of "yyerrflag" themselves. Instead, call the "yyerrok" macro, as described in Section 8.4.3.

8.4.3 The yyerrok Macro

In some situations, you may want YYSERROR to be called even if the parser hasn't seen three correct tokens since the last error. There are also situations in which you want to be able to Shift on $catch, even if the parser was in the process of recovering from an error. In these cases, you can use the "yyerrok" macro to leave error recovery mode prematurely. Then, if the parser encounters another error, it is treated as a new error rather than a continuation of the previous one.

For example, suppose you have a parser for a line-by-line desk calculator. A line of input contains errors, so YYSERROR is called. YYSERROR typically prints an error message to the user, throws away the rest of the line, and prompts for new input. If the next line contains an error in the first three tokens, the parser will normally start discarding input without calling YYSERROR again. This means that YYSERROR doesn't print an error message for the user, even though the input line is wrong.

To avoid this problem, you can explicitly tell the parser to leave error recovery mode, even if it hasn't yet Shifted on three valid tokens. Simply say

          yyerrok ;

as part of the error recognition action. For example, you might have the rule

          expr : error
               {
                   yyerrok;
                   printf("Please re-enter line.\n");
                   yyclearin;               }

The "yyerrok" macro expands into code that takes the parser out of error recovery mode and lets it start fresh. As one of its actions, "yyerrok" sets "yyerrflag" to YYE_NONE.

8.5 Writing Rules with Errors in Mind

Suppose a parser has just Shifted on error and is now in error recovery mode. If the lookahead token is invalid, the parser discards the token and keeps reading new tokens until it finds one that is valid in the current state. It is therefore desirable that Shifting on error puts you into a state where a large number of tokens are valid. This minimizes the amount of input that is discarded when searching for a valid token.

For example, consider the following rules:

          block     : /* nothing */
                    | block statement
                    ;
          statement : error ';'
                    | /* defs for valid statements */
                    ;

If there is an error in the middle of a statement, the parser will typically pop back to a state of the form

          statement : . error ';'
          /* other definitions of statement */

and then Shift on error into the state

          statement : error. ';'

At this point, the only valid lookahead token is a semicolon. Therefore, the parser discards all subsequent input until it finds a semicolon. In the process, it may discard other tokens that are important for understanding the input (for example, parentheses, braces, and other delimiters that must be balanced for syntactic correctness). This often makes it very difficult to parse any of the remaining input.

On the other hand, if the rule is written as

          block     : /* nothing */
                    | block statement
                    ;
          statement : error
                    | ';'
                    | /* defs for valid statements */
                    ;

a Shift on error leaves the grammar in the state

          statement : error.

The parser can immediately Reduce to a statement, after which the parser looks for the beginning of a new statement. The result is that the parser only discards tokens until it finds a token that can start a new statement. Compare this with the first result: in the first example, the parser discarded input until it found a semicolon; in the second example, the parser only discarded input until it found any token that could start a statement. The second approach generally discards less input. More of the program gets parsed, and there is less chance of completely losing your place by skipping over parentheses, braces, and other important delimiters.

8.6 The yyclearin Macro

After Shifting on $catch or error, the parser restores the lookahead symbol to the value it had at the time the error was detected. However, this is sometimes undesirable.

For example, your grammar may have a recognition action associated with the error symbol and this may read through the next lot of input until it finds the next sure-to-be-valid data. If this happens, you don't want the parser to pick up the old lookahead symbol again once error recovery is finished.

If you want the parser to throw away the old lookahead symbol after an error, put

          yyclearin ;

in the recognition action associated with the error symbol. "yyclearin" is a macro that expands into code that discards the lookahead symbol.

8.7 YYSERROR

If a syntax error occurs outside error recovery mode and the parser cannot Shift on $catch, the parser immediately makes the macro call

          YYSERROR("Syntax error");

This happens before the parser Shifts on error.

By default, the above macro call expands to

          yyerror("Syntax error");

"yyerror" is a user-defined function that is passed one argument: a character string describing the type of error that just took place. For syntax errors, this string is always

          "Syntax error"

There are several other possible error strings related errors in the way the parser uses stacks; for full details, see Section 8.7.1.

You must write your own "yyerror" function. The simplest such functions either abort the parsing job or print the error message that the parser passes as an argument.

Your "yyerror" function may also output a more detailed diagnostic message explaining the error that has just occurred. This means that your user-written code should make note of any information required to output useful error messages. For example, it is often helpful for "yylex" to keep a count of input lines that have been read. This lets your error message include the line number where the error was detected.

In place of the default YYSERROR, you may wish to #define your own YYSERROR in the declarations section of your YAY input. For example, you may want to define YYSERROR to call a function that is specifically defined to deal with syntax errors, while the normal "yyerror" function deals with stack errors.

8.7.1 The yyerror Function

The "yyerror" function is called directly when the parser detects an error with the stacks used in the parsing process. "yyerror" is passed a single argument consisting of a character string. Possible values are

          "Not enough space for parser stacks"

which is passed if the parser cannot allocate the stacks it needs to work, and

          "Parser stack overflow"

which is used if an existing parser stack runs out of space. If your program runs into problems of this kind, you may need to change the way your program uses stacks; see Section 13.7 for details.

Remember that the default YYSERROR calls "yyerror" to handle syntax errors as well as stack errors. In this case, your "yyerror" function should be able to handle syntax errors as well as parser stack errors.

8.7.2 Changing yychar in yyerror

Your "yyerror" or YYSERROR may change the value of "yychar" (the variable that indicates the type of token just read by "yylex"). If this happens, "yyparse" immediately tries to Shift or Reduce using the new "yychar" token in the same state. If there is a valid action that can be taken, "yyparse" proceeds as if the error never happened.

In general, the ability to change "yychar" in this way is a hold-over from older parser generators. With YAY, you will find it easier to use $catch instead of changing "yychar".

8.8 Other Error Support Routines

This section discusses a number of macros that can be used in the source code of recognition actions.

8.8.1 The YYABORT Macro

YYABORT halts "yyparse" in midstream and immediately returns a 1. To the function that called "yyparse", this means that "yyparse" failed for some reason.

8.8.2 The YYACCEPT Macro

YYACCEPT halts the parser in midstream and returns a 0. To the function that called "yyparse", this means that "yyparse" terminated successfully, even if the entire input has not yet been scanned.

8.8.3 YYE_SHIFT and YYE_NONE

The YYE_SHIFT macro specifies values for "yyerrflag". YYE_SHIFT(0) is the value after an error has occurred, but before the parser has managed to Shift on any non-error token. YYE_SHIFT(1) is used after the parser has Shifted once, YYE_SHIFT(2) after the parser has Shifted twice, and so on.

The macro YYE_NONE gives the value of "yyerrflag" when you are not in error recovery mode.

8.8.4 The YYERROR Macro

YYERROR (upper case) is a macro that "fakes" a syntax error. More specifically, it sets "yyerrflag" to YYE_SHIFT(0), enters error recovery mode, and begins popping states off the stack in search of a state that can Shift on error.

Consider the following example:

          full_stat : statement ';'
                    ;
          statement : $catch [ ';' ] { action1 }
                    | $catch %default {
                            /* other code */
                            YYERROR ; }
                    | error { action3 }
                    | statement_definition
                    ;

If the program is beginning to parse a statement and an invalid lookahead token is found, the grammar executes action1 if the lookahead symbol is a semicolon. Otherwise, the parser Reduces using the rule containing the second $catch. This executes code ending in YYERROR, which starts the error-handling process. The error-handling process will Shift on error, as provided for by the third rule in the definition of statement.

The parser might also reach this state if an error is detected in the middle of a statement. In this case, the error would have been detected in another state, so the parser does not Shift on either $catch. Instead, it Shifts on error into the state

          statement : error. { action3 }

The parser then Reduces and executes action3. This lets you create different type of error recognition actions depending on how the error was encountered.

8.8.5 How YYERROR May Be Used

As discussed in the previous section, the YYERROR macro creates an artificial error condition. To show how this can be useful, suppose we have a line-by-line desk calculator that allows parenthesization of expressions and suppose we have a variable named "depth" that keeps track of how deeply parentheses are nested. Every time the parser finds an opening parenthesis, it adds 1 to "depth". Every time it finds a closing parenthesis, it subtracts 1.

Consider how the following definitions will work.

          expr : lp expr ')'
                    {depth--;}
               | lp error
                    {depth--;}
               ;
          lp : '(' {depth++;};

If no error occurs, the "depth" variable is incremented and decremented correctly. If an error does occur, however, what happens? The YYSERROR macro is invoked to recover from the error in the middle of an expression. Often, it is more reasonable to postpone this recovery until you reach a point where you have a whole expression. Therefore, you might use the following alternate definition.

          expr : lp error
                    {depth--; YYERROR;}
               ;
          line : error '\n' prompt line
                 { $$ = $4; } ;
          prompt : /* null token */
                 {printf("Please re-enter line.\n");};

Now, what happens when the grammar is asked to parse a line like

          1 + (( a +

When the end of the line is encountered, the parser recognizes an error has occurred. Going up the stack, the first state ready to handle the error will do so using the item

          expr : lp.error ;

At this point, the parser will Reduce the input

          ( a +

into an "expr". The Reduction executes the recognition action: it decrements "depth", then signals that an error has taken place. The Error action begins popping the stack again. It will find the previous opening parenthesis, recognize another

          lp error

construct and perform another reduction. The parenthesis count is again decremented, and another error condition generated.

This time, the grammar rule that deals with the error is the definition of "line". An error message is issued and a new line is requested. In this way, the parser has worked its way back to error-handling code that can deal with the situation. Along the way, the parser correctly decremented the "depth" variable to account for the missing parentheses.

This method of dealing with errors decrements "depth" for every unbalanced opening parenthesis on the line. This corrects the "depth" count properly. Our first definition (without the YYERROR call) would only have decremented "depth" once.

This example is somewhat contrived, of course--you could always just set "depth" to zero whenever you started a new line of input. The usefulness of the technique is more apparent in situations where you obtain memory with "malloc" whenever you get an opening delimiter and free the memory with "free" whenever you get a closing delimiter. In this case, it is obvious that you need to do precisely as many "free" operations as "malloc" operations, so you must raise the error condition for each unbalanced opening delimiter.

You might think that the symbol "lp" is unnecessary, and you could just define

          expr : '(' {depth++;} expr ')' {depth--;}
               | '(' error {depth--;} ;

However, this would not work in general. There is no guarantee that the action

          {depth++;}

would be executed in all cases, particularly if the token after the '(' was one that could not start an expression.

As an interesting example of another way to use YYERROR, consider the following (taken from a parser for the Pascal programming language). label_list : label_list ',' label | label | error | error [LABEL CONST VAR PROC FUNC BEGIN] {YYERROR;}; This deals with errors in two different ways:

If an error is followed by one of the tokens LABEL, CONST, etc. (representing the beginning of new declaration sections in Pascal), the input is Reduced to a "label_list". The recognition action associated with this Reduction then executes YYERROR to raise an error condition. This has the effect of popping the "label_list" off the value stack and going back to the previous state on the state stack that can Shift on error.
The other rule is used when the parser finds an error which is not followed by one of the listed tokens. This corresponds to an error in the middle of a label list and requires a different sort of handling. In this case, the parser simply Reduces the error to a "label_list". This "label_list" remains on the value stack and the parser resumes processing from that point.

This kind of approach can be used to distinguish different kinds of errors that may take place in a particular situation.

8.9 The $error Symbol

$error is a non-terminal symbol that is automatically defined for every grammar. It is used internally, for implementing multi-valued grammar rules (described in Chapter 11). YAY automatically adds a rule of the form

          $error: $error

to every grammar.

9. Types

Chapter 3 mentioned that "yylval" is int by default, as are $$, $1, $2, etc. If you want these to have different types, you may redeclare them in the declarations section of the YAY input. This is done with either a %union statement or a %struct statement.

9.1 The %union Statement

The %union statement has the form

          %union {
               /* possible types for "yylval" and
                * $$, $1, $2, etc.  */
          }

For example, suppose "yylval" can be either integer or floating point. You might write

          %union {
               int intval;
               float realval;
          }

in the declarations section of the YAY input. YAY converts the %union statement into the following source:

          typedef union {
              int intval;
              float realval;
          } YYSTYPE;

"yylval" is always declared to have type YYSTYPE. If no %union statement is given in the YAY input, the parser uses

          #define YYSTYPE int

As a more complicated example of %union, you might define

          %union {
              int intval;
              struct {
                  unsigned code;
                  SYM *symptr;
              } sym;
          };

This states that "yylval" values may either be integers or values of the given structure type. The structure type has two elements: a code indicating a particular kind of argument and a pointer to a SYM value (which presumably contains information about a symbol found in the input).

9.1.1 The %type Statement

Once YYSTYPE has been defined as a union, you may specify a particular interpretation of the union by including a statement of the form

          %type <interpretation> SYMBOL SYMBOL ...

in the declarations section of the YAY input. The "interpretation" enclosed in the angle brackets is the name of the interpretation of the union variable that you want to use. The SYMBOL names are either the names of non-terminal symbols defined in the grammar rules, or tokens that have already been declared in %token, %left, %right, or %nonassoc statements. For example, you might write

          %type <intval> intexp
          %type <realval> realexp

to indicate that an integer expression has an integer value and a real expression has a floating point value. Thus, when YAY is parsing rules like

          intexp : intexp op intexp;
          realexp : realexp op realexp;

YAY knows that the value of $$ is an integer for intexp and a floating point value for realexp.

The form

          %type <*> SYMBOL SYMBOL ...

indicates that the symbol values refer to the whole YYSTYPE, rather than a particular interpretation.

The %type statement can also take the form

          %type <> SYMBOL SYMBOL ...

with no type name inside the <>. This indicates that the given symbols do not have associated "yylval" values and therefore do not have types. For example, in a grammar for the C language, you might write

          %type <> forstat whilestat ifstat

to indicate that for statements, while statements, and if statements did not have associated values. YAY will issue an appropriate diagnostic message if you attempt to use a $$ or $N symbol in connection with such items.

The %type statement can also take the form

          %type <type1 type2 ...> SYMBOL SYMBOL ...

where each of the types inside the angle brackets are interpretations of the YYSTYPE union. This says that the given symbols may have any of the given types. The default type is type1, the first type given inside the angle brackets.

You may have multiple %type statements for the same symbol, as in

          %type <type1> X
          %type <type2> X

The above statements are equivalent to

          %type <type1 type2> X

Finally, the %type statement can take the form

          %type <type1 type2 ...> *

The * stands for all symbol names that already have some other type associated with them. Therefore, the above statement specifies additional types for any symbol that already has a type.

As a typical example of assigning multiple types to a symbol, suppose you have the following code:

          %union {
              int intval;
              struct {
                  unsigned code;
                  SYM *symptr;
              } sym;
          };

          %type <sym.code sym.symptr> SYMBOL

The "yylval" value associated with SYMBOL will be a structure with a code and a symptr. The %type statement makes it possible to refer to either the code or the symptr value separately in recognition actions (although code is the default).

As another example, suppose you define

          %type <sym> xyz

Then you can reference elements of the structure "sym" with expressions like

          $$.code    or    $$.symptr

Note: Items inside the angle brackets of the %type instruction are delimited by white space. Therefore, if you specify types like sym.symptr and sym.code above, you must not put white space around the "." character. For example, the following would lead to syntax errors in your YAY input: %type <sym . symptr sym . code> SYMBOL /* Error! */

9.1.2 Assigning Types to Tokens

Tokens may also be declared to have types. This can be done in the %token statement in the declaration section, and again shows the union interpretation in angle brackets, e.g.

          %token <realval> floatnum

You can also specify types in %left, %right, and %nonassoc statements, as in

          %left <intval> '&' '|'

This raises the possibility of assigning several types to the same token, as in

          %token <intval> NUM
          %type <realval> NUM

In this case, the token can have either of the given types. The default type is the type that appears earliest in the grammar input.

The notation

          %type <type1 type2 ...> *

adds additional types to tokens as well as to non-terminal symbols.

9.1.3 Explicit Typing

You can specify an explicit interpretation of the union by placing the name in angle brackets after the first $. For example,

          $<intval>1

takes the value of $1 and uses the intval interpretation. This means that YAY generates code similar to

          yylval.intval

which uses the dot operation to pick the desired interpretation of the union data object.

If you do not specify the type explicitly in this way, YAY assumes that the type of $1 is the default type for the associated token or symbol. As mentioned in a previous section, the default type is the first type associated with the token or symbol. For example, in

          %type <type1 type2 ...> SYMBOL

the default type is type1. YAY will use a dot operation to select the default interpretation from the YYSTYPE union.

When you specify an explicit type, the type must be one of the possible alternatives specified in a preceding %type statement. This is why declarations like

          %type <sym.symptr sym.code> SYMBOL

are useful--they let you refer to individual structure elements using input like

          $<sym.symptr>1
          $<sym.code>1

As noted in Section 3.2.2, the parser performs the operation

          $$ = $1

as a default recognition action, before performing any other recognition action specified. This assignment stores the entire YYSTYPE value of $1 in $$, NOT the default interpretation. However, YAY does attempt to verify the type compatibility of the default assignment.

Note: Although it is easier to think of the default recognition action as $$=$1, it should actually be written

          $<*>$ = $<*>1

The entire YYSTYPE value of $1 is used as the value of $$.

9.1.4 Inferred Types

If an action refers to a $N value for which there is no declared type, YAY attempts to infer a type. If the value is specified as

          $<type>N

the corresponding symbol is henceforth assumed to have the type that is specified inside the angle brackets.

YAY also attempts to infer a type from default

          $$ = $1

assignments. This means that the type of any $$ is assumed to be the same as the type of $1, in the absence of any other way to determine the type of $$.

This assumption is useful for rules which have embedded actions. Since the actions have no associated name, you cannot declare their type with a %type statement. You can still specify a type explicitly using $<type>N or $<type>$ .

9.2 The %struct Statement

The %struct statement is similar to %union and has the same purpose. It defines YYSTYPE as a structure type rather than a union. For example, you might put the following in the declarations section of your YAY input.

          %struct {
              char * filename;
              int linenum;
              union {
                  int intval;
                  float realval;
              } nums;
          }

This generates the typedef definition

          typedef struct {
              char * filename;
              int linenum;
              union {
                  int intval;
                  float realval;
              } nums;
          } YYSTYPE;

Since this version of YYSTYPE is a structure, all the "yylval" values stored on the value stack are structures. The example given above lets you associate a file name and a line number with each value placed on the stack; in this way, the parser can record the file name and line number where a particular item appeared in the input being parsed.

If $N refers to the "yylval" value of some symbol,

          $<element>N

refers to a particular element of the "yylval" structure. For example,

          $<filename>1

refers to the filename element associated with the $1 structure.

          $<nums.intval>1

refers to the integer interpretation of the nums union within the YYSTYPE structure.

In situations like this, it is useful to specify

          %type <filename linenum> *

which states that tokens may have multiple types as well as non-terminal symbols.

Note: Since C++ is very restrictive about the contents of unions, most grammars that produce C++ code should use %struct instead of %union.

9.3 The Default Action

The default action associated with any rule can be written as

          $<*>$ = $<*>1

which means that the value of associated with $1 on the value stack is assigned to $$ on the value stack when the rule is reduced. This assignment copies the entire YYSTYPE value. YAY issues a diagnostic message if this assignment appears to be type- incompatible.

Note: Suppose that the recognition action associated with a rule explicitly states

          $$ = $1

This explicit assignment may not have the same effect as the implicit assignment. For example, suppose that you define

          %union {
              float floatval;
              int intval;
          }

Also suppose that the type associated with $$ is floatval and the type associated with $1 is intval. Then the explicit statement

          $$ = $1

requests the compiler to perform an integer to floating point conversion when the value of $1 is assigned to $$, whereas the implicit statement did a union to union assignment and did not perform this conversion. You must therefore be careful and think about the effects of implicit vs. explicit assignments.

9.4 Types in Rules with Embedded Actions

Chapter 3 showed that a rule can have embedded actions, as in

          a : A1 {action1} A2 {action2} A3 {action3};

In this case,

          $1   -- is the value of A1
          $2   -- is the value of $$ in action1
          $3   -- is the value of A2
          $4   -- is the value of $$ in action2
          $5   -- is the value of A3

If types are involved, embedded actions become more complicated. If action1 mentions $$, there is no way for YAY to guess what type $$ has, since YAY does not yet know if $$ refers to an embedded rule or to the left hand side of the explicit rule. The default type for $$ is based on the left hand side of the rule. You must therefore state the type explicitly by specifying the appropriate type name in angle brackets between the two $ signs. If we had

          %union {
              int intval;
              float realval;
          }

you might say

          $<realval>$

in place of $$ in the action, to show that the result had type float. In the same way, if action2 refers to $2 (the result of action1), you might say

          $<realval>2

10. Trial-and-Error Parsing

Previous chapters described the creation of parsers that took action based on a single lookahead token. This chapter examines parsers that can look farther ahead in order to determine what action to take.

10.1 The YYPOP Macro

The YYPOP macro lets you implement parsers that perform any amount of lookahead. YYPOP has the form

          YYPOP( tokenname )

where tokenname is the name of a token that "yylex" may return.

Calls to YYPOP should appear in the recognition action code for some rule. When the macro is executed, it starts popping states from the state stack (and corresponding values from the value stack) until it finds a state that Shifts on the specified token. YYPOP then performs the Shift.

As an example of how this might be useful, consider a construct that can be interpreted as either an expression or a declaration, depending on semantic factors (not purely syntactical ones). You could handle this situation with rules of the following form:

    %token ISEXPR ISDECL  /* never returned by yylex */
        ...
    construct : ISEXPR /* handle as expression */
              | ISDECL /* handle as declaration */
              | ambiguous
              ;

The definition(s) of ambiguous would give the actual syntax of the construct. Recognition actions for ambiguous would examine semantic factors in an effort to determine whether the ambiguous construct was a declaration or expression. If a recognition action finally decided that the construct was an expression, it would execute

          YYPOP( ISEXPR )

YYPOP starts popping states off the state stack until it gets to the state corresponding to the beginning of the construct. Since that state can Shift on ISEXPR, YYPOP takes the Shift and parsing continues in the knowledge that you are processing an expression. Similarly, if a recognition action decided that the construct was a declaration,

          YYPOP( ISDECL )

pops back up the state stack until it can shift on ISDECL.

This demonstrates that YYPOP gives you the ability to perform "trial-and- error" parsing. You can perform any amount of lookahead along one possible line of interpretation, then change your mind and pop back to try a different line of interpretation.

Typically, the token used as an argument for YYPOP is one that is never returned by "yylex". It only appears in the grammar for use by YYPOP.

The standard error-handling process essentially uses

          YYPOP( error )

to find the state that should handle the current error. In other words, the parser pops back up through states on the state stack until it finds a state that can Shift on error.

10.2 YYPOP and "yylex"

If you use YYPOP to pop back to a previous state, your code must communicate with "yylex" to cope with the pop-back process. For example, when you begin a trial-and-error examination of a particular construct, you usually record the seek address where you start the examination. If you pop-back for another pass through the input, you can then restore the input position to the saved address so that you can re-examine the input from the same point. This means that your "yylex" routine must be able to seek backward and "replay" parts of the input token stream.

For one example of how you can do this, see Appendix A.2.

11. Multi-Valued Grammar Rules

The preceding chapters of this manual have discussed simple grammar rules of the form

          symbol : definition {action} ;

YAY also supports more complex definitions of the form

          (symbol1 | symbol2 | ... ) : definition {action} ;

This defines several symbols using the same definition. For example,

  (variable | typedefname | label) : IDENTIFIER {action} ;

states that a variable, a typedefname, and a label are each defined as an identifier.

A multi-valued grammar rule must have an associated recognition action. Furthermore, this action must execute a $select operation, as described in the next section.

11.1 The $select Operator

The $select operator is used in the recognition action of a multi-valued grammar rule to specify which symbol is being defined. For example, consider the rule given previously, defining variable, typedefname, and label. The recognition action could examine the actual identifier and check it against a symbol table to determine if the identifier has been used before. If the identifier is already used as a label in the current scope, the recognition action could execute the statement

          $select label;

to specify that the identifier has been identified as a label. Similarly,

          $select variable;

specifies that you have found a variable and

          $select typedefname;

indicates that you have found a typedefname.

11.2 Why Use Multi-Valued Grammar Rules

Multi-valued grammar rules are typically used in contexts where the lexical analyzer may not have enough information to identify the nature of a token. For example, consider the following example of C source code:

          typedef int a;
          int func (int x) {
              void *a;
              /* rest of function definition */
          }
          a b;

At the time that "yylex" reads the a in the last line of the example, it may be looking ahead from the context of the function definition. In that context, a should be interpreted as a variable name rather than a typedef. Therefore, the lexical analyzer is likely to identify a incorrectly as a variable.

Another example is found in the C++ construct

          A::B

In cases like this, the lexical analyzer may find it impossible to determine what B is on purely syntactic grounds (an object name, a type name, etc.). Therefore, your parser can only determine the meaning after examining semantic information (e.g. what is defined within the appropriate scope).

To avoid this problem, you can simply have the lexical analyzer report that it has found an identifier, rather than having "yylex" decide the nature of that identifier. Later on, when you have all the proper scoping information, you can use $select to tell the parser what kind of identifier you've actually found.

11.3 The $test Operator

The $test operator can be used in the recognition action of a multi-valued grammar rule to test whether selecting a symbol is "valid" in the current context. More specifically,

          $test varname

does a crude validity test to see what would happen if you perform

          $select varname

The validity test examines the state you would end up in if you did a Reduce to varname; if that state has a Goto operation, varname is considered to be valid in this context. In other words, the validity test determines whether the grammar has somewhere to go if it Reduces to varname.

$test returns a non-zero value (true) if the specified symbol passes the validity test and returns zero (false) if the specified symbol does not pass (i.e. there is no Goto operation). We should stress that the validity test is crude and may not be sufficient for sophisticated applications.

The $test operator can test several symbols at once, using the syntax

          $test symbol1 | symbol2 | ...

This tests to see if any of the given symbols pass the validity test. $test returns a non-zero value if any of the symbols pass the test, and returns zero if none of them do.

If a $test expression is followed by an OR-bar character, use parentheses to distinguish the two, as in

          ($test A|B|C) || arg

Note that the parentheses are necessary, even when the OR-bar character is part of a || operator.

11.4 $select with Multiple Arguments

The $select instruction can take the form

          $select symbol1 | symbol2 | ...

YAY performs a $test operation on each of the symbols in the order listed, and selects the first one for which $test returns true. If $test returns false for every symbol in the list, YAY selects the last symbol in the list.

12. YAY Macros

YAY macros serve the same purpose as macros in C or C++. They let you simplify the appearance of your grammar by defining generic rule descriptions which can then be used to construct specific rules.

12.1 Declaring Macros in %type Statements

Before you can define a macro, you must declare it. Declarations are provided by %type statements in the declarations section of your program.

The general form for declaring a macro is

%type <mactype1 mactype2 ...>
        (<arg1type1 arg1type2 ...>,<arg2type1 ...>,...)
        macroname1 macroname2 ...

This should all be placed on one line; it has only been broken into several lines above because of the limited width of the page.

The mactype arguments indicate the possible types of value for the rules that result from expanding the macro. The argtype arguments indicate the possible types of each macro argument. The macroname arguments list one or more identifiers that will be used as macro names. All of the type values must be interpretations of the YYSTYPE %union or members of the YYSTYPE %struct. The following code gives an example:

   %union {
       int intval;
       double realval;
   }
   %type <realval intval> (<realval>,<realval>) mac1

The %type statement declares a macro named mac1. When the macro is expanded into one or more rules, the $$ value for these rules can be either double or int (realval or intval). The default is realval because it is specified first. The macro takes two arguments, both of which are symbols with the realval type.

As shown previously, you can specify a list of types for each argument. For example,

   %type <realval> (<realval intval>) mac2

declares a macro that takes a single argument. The type of this argument may be either realval or intval. When an argument can have several types, the primary type is the first type listed inside the angle brackets.

If a macro does not take any arguments, leave the parentheses empty, as in

   %type <intval> () mac3

While a macro with zero arguments is valid, there is no good reason to create one. The above macro, for example, offers no advantage over defining an ordinary non-terminal symbol named mac3.

12.2 Defining a Macro

The actual definition of a macro is placed in the rules section of the YAY input. A simple macro definition has the form

          macroname(parm1,parm2,...) : rule {action}
                                     | rule {action}
                                     | ...
                                     ;

The parms are identifiers which serve as formal parameters. The rules that follow the : provide the definition of the macro. Within the definition of the macro, the formal parameters may appear as symbols (terminal or non-terminal) or as the argument of %null or %nonnull. (For further information on %null and %nonnull, see Section 13.3.) The formal parameters may also be passed as arguments to another macro.

Here is a simple macro definition.

          leftlist(item) : /* null */                         | item
                         | leftlist(item) ',' item
                         ;

The number of formal parameters given in macro definition must match the number of arguments specified in the %type directive that declared the macro.

12.3 Macro Calls

Once you have defined a macro, you can use it in YAY rule definitions by specifying a macro call. A macro call has the form

          macroname( arg1, arg2, ... )

where macroname is the name of the macro. The macro is expanded by defining a new non- terminal symbol of the form macroname(key,key,...) (if such a definition does not exist already). The key values are abbreviated names for the arguments as given in the token and non-terminal listings of the Parser Description report. The definition of this new non-terminal symbol is derived from the macro definition; wherever formal parameters appear in the macro definition, they are replaced by the corresponding argument value.

For example, the following grammar makes two uses of the leftlist macro defined in the previous section.

          typelist : leftlist(type);
          varlist  : leftlist(var);

YAY expands the above macro usages into the following rules:

          typelist      : leftlist(V42);
          leftlist(V42) : /* null */
                        | type
                        | leftlist(V42) ',' type
                        ;
          varlist       : leftlist(V69);
          leftlist(V69) : /* null */
                        | var
                        | leftlist(V69) ',' var
                        ;

The arguments passed to a macro must be tokens, non-terminal symbols, or %null or %nonnull expressions (as explained in Section 13.3). You cannot pass a macro name as the argument to another macro, but you can pass a macro invocation. For example, mac1(mac2(X)) is valid, but mac1(mac2) is not.

In a macro call, the primary type of each macro argument must match the primary type specified in the %type declaration for the macro. In addition, the overall types for each actual macro argument must include all the types specified in the %type declaration for the macro.

12.4 Multi-Valued Macro Definitions

The format

(macname1 | macname2 | ...)(parm1,parm2,...) : rule {action}
                                             | rule {action}
                                             | ...
                                             ;

is syntatically valid but does not work. At some later point in processing, YAY will issue an error message because of this kind of construct.

13. Advanced Topics

This chapter provides a variety of information about YAY.

13.1 Negative Numbers in $N Constructs

YAY lets you use constructs like $0, $-1, $-2, and so on in recognition actions. These were once important, but the techniques for specifying multiple actions have made them obsolete. YAY only supports the constructs for compatibility with older grammars.

To understand what these constructs mean, it is important to think in terms of the state stack. Each $N construct is associated with a state on the stack; the value of $N is the value of the token or non-terminal symbol associated with the state at the time of a Reduce operation. (Recall that recognition actions are executed when the appropriate Reduce action takes place.)

$1 is the value associated with the state that found the first component of the grammar rule; $2 is the value associated with the second state, and so on.

$0 is the value associated with the state that was on top of the stack before the first component of the grammar rule was found.

$-1 is the value associated with the state before that, and so on. All of these states are still on the stack, and their value can be obtained in this way.

As an artificial example, suppose that a grammar has the rules

          stmt : IF condition stmt
               | WHILE condition stmt
          condition : /* something */
                    { /* action */ }

The action associated with the condition can use the $-1 construct to find out if the preceding token was IF or WHILE. (Of course, this assumes that the only items that can precede a condition are the IF and WHILE tokens.) There are occasionally times when this sort of information is needed.

YAY does not retain typing information for $0, $-1, and so on. Therefore, your code must specify an explicit type for these items if YYSTYPE is a structure or union. For example, you might write $<intval>0 if YYSTYPE is a union type and the type of $0 corresponds to the intval interpretation of that union.

13.2 Lists and Handling Null Strings

Grammars often define lists of items. There are two common ways to do this:

          list : item
               | list item ;

          list : /* null */
               | list item ;

The first definition means that every "list" has at least one item. The second allows zero-length lists.

Using the second definition is sometimes necessary or convenient, but it can lead to difficulties. To understand why, consider a grammar with

          list1 : /* null */
                | list1 item1 ;
          list2 : /* null */
                | list2 item2 ;
          list  : list1
                | list2 ;

When the parser is in a position to look for a "list", it automatically finds a null string, then gets a conflict because it can't decide if the null string is an instance of "list1" or "list2". This problem is less likely to happen if you define

          list1 : item1
                | list1 item1 ;
          list2 : item2
                | list2 item2 ;
          list  : /* null */
                | list1
                | list2
                ;

The parser can determine if it has a "list1" or a "list2" by seeing if the list starts with "item1" or "item2".

13.3 %nonnull and %null

The expression

          %nonnull varname

can be used inside the body of a rule wherever a non-terminal symbol is valid. In the expression, "varname" should be the name of a non-terminal symbol defined elsewhere in the grammar. %nonnull varname stands for all the non-null definitions of varname (if any). For example, here is another way to write the example from the previous section:

          list1 : /* null */
                | list1 item1 ;
          list2 : /* null */
                | list2 item2 ;
          list  : /* null */
                | %nonnull list1
                | %nonnull list2
                ;

This states that a list is either a null string, or a non-null list1 or list2.

YAY implements %nonnull by creating a new symbol for each %nonnull expression. For example, the above input turns into the following rules:

          list1:
          list1:   list1 item1
          list2:
          list2:   list2 item2
          list:
          list:    %nonnull-list1
          list:    %nonnull-list2
          %nonnull-list1:  list1 item1
          %nonnull-list2:  list2 item2

As the example shows, YAY created new non-terminal symbols named %nonnull-list1 and %nonnull-list2, then defined rules for these symbols using the non-null rules for list1 and list2.

The expression

          %null varname

refers to all the null definitions of the non-terminal symbol varname (if any). YAY implements %null by creating a new symbol for each %null expression, in a similar manner to %nonnull. If necessary, YAY also creates new symbols for components used to define varname.

Applying %nonnull to a symbol may implicitly create new non-terminal symbols for components used to define that symbol. For example, consider

          a : b c d ;

If you use the expression %nonnull a, YAY may have to define new non-terminal symbols

          %nonnull-b
          %nonnull-c
          %nonnull-d
          %null-b
          %null-c

and then

          %nonnull-a : %nonnull-b c d
                     | %null-b %nonnull-c d
                     | %null-b %null-c %nonnull-d ;

in order to specify a non-null definition for a. (The above format is needed to prevent Reduce/Reduce conflicts.)

If %null or %nonnull is applied to a symbol defined in a multi-valued rule, as in

          (sym1 | sym2 | sym3) : definition { action };
          sym4                 : %nonnull sym1 ;

the construct is accepted syntactically. However, when YAY creates its definitions for new non- terminal symbols, the result is often erroneous.

If you apply %nonnull to a token, the result is equivalent to the token itself. For example,

          %nonnull '+'

is equivalent to '+'. If you apply %null to a token, the result is equivalent to the $error symbol (which is a non-terminal symbol that cannot be Reduced).

13.4 Right vs. Left Recursion

Chapter 3 mentioned left and right recursion. For example, if a program consists of a number of statements separated by semicolons, we might define it with right recursion as

          program : statement
                  | statement ';' program ;

or with left recursion as

          program : statement
                  | program ';' statement ;

If you think about the way that the state stack works, you will see that the second way is much to be preferred. Consider, for example, the way something like

          S1 ; S2 ; S3 ; S4

would be handled (where all the Sn's are statements).

With right recursion, the parser would gather "S1;" and then go looking for a program. To gather this program, it would gather S2. It would then look at the lookahead symbol ; and see that this program had the form

          statement ';' program

The parser would then go ahead and gather the program after the semicolon. But after S3, it would find another semicolon, so would begin gathering yet another program. If you work the process through, you will find that the state stack grows to seven entries (one for each Sn and one for each ;) before the first Reduce takes place.

On the other hand, if you have the left recursion

          program : program ';' statement

and the same input, the parser will perform a Reduce as soon as it sees

          S1 ; S2

This is Reduced to a single state corresponding to the non-terminal symbol "program". The parser reads ";S3" and Reduces

          program ; S3

to "program" again. The process repeats for the last statement.

If you follow this through, the state stack never grows longer than three states, as compared to the seven that are required for the right recursive rule. With right recursion, no reduction takes place until the entire list of elements has been read; the longer the list, the more stack space you need to handle it. With left recursion, a reduction takes place as each new list element is encountered; long lists do not require any more stack space than short ones. Left recursion can therefore save a lot of stack space.

Note: Apart from left recursion, all types of recursion can gobble up stack space. For example, consider the rule

          expr : '(' expr ')' ;

If you have the input

          (((((((( a+b ))))))))

the parser has to start a new stack entry for each '(' in the list, since the parser has to Shift on each '('. Whenever possible, use left recursion rather than other types of recursion.

The choice of left or non-left recursion can also affect the order in which recognition actions are performed. Suppose T is a token. If you define

          x : /* null */
            | y x {a1} ;
          y : T {a2} ;

then the input

          T T T

executes recognition actions in the order

          {a2} {a2} {a2} {a1} {a1} {a1}

The {a2} actions are performed each time a T is Reduced to "y". The {a1} actions don't happen until the entire list has been read, because right recursion reads the entire list before any Reduce actions take place.

On the other hand, if you define

          x : /* null */
            | x y {a1} ;
          y : T {a2};

the recognition actions for the same input take place in the order

          {a2} {a1} {a2} {a1} {a2} {a1}

With left recursion, Reduce actions take place every time a new element is read in for the list.

This means that if you want the action order

          {a2} {a2} {a2} {a1} {a1} {a1}

you must use right recursion even though it takes more stack space.

13.5 Debugging YAY Parsers

YAY provides a number of features that can be used when debugging YAY-produced parsers. For example, YAY defines a function named "yyprule" which displays a specified rule on the standard output. If you are using a debugger that lets you call functions during a debugging session, you can call "yyprule" to obtain a rule definition (without going to the trouble of checking the Parser Description report). Other debugging support features are described in the following sections.

Note: In YAY code, the macro YYfile is defined to be the standard output for debugging output. If you want to send debugging output to some other destination (e.g. a log file), #define YYfile to the name of the output stream you want to use.

13.5.1 YYDEBUG

If you #define a symbol named YYDEBUG in the declarations section and give it a non-zero value, your parser will print a good deal of debugging information as it parses input. The parser also sets a variable "yydebug" to the value of YYDEBUG. Setting YYDEBUG (and therefore "yydebug") to a positive value produces more output than setting it to a negative value or zero. The rest of this section describes the output you may see.

(Note: If you do not #define YYDEBUG, you do not get any debugging code; it is removed by #ifdef statements. If you #define YYDEBUG as zero, the code is compiled but starts out inactive. You can activate it during execution by assigning a positive value to "yydebug".)

Every time "yylex" obtains a token, the parser prints

          read      (VALUE) T

T is the name of the token and VALUE is the numeric value. Thus if "yylex" has read an IF token, you might see

          read      (257) IF

If "yydebug" is greater than zero, every time the parser enters a state, it will print out

          state  N  (X)

where N is the state number as given in the State Description report, and X is another integer. X is another number for the state--YAY actually renumbers the states and grammar rules after it generates the State Description report in order to improve the parser's efficiency, and X gives the state number after renumbering. As an example,

          state   6  (22)

indicates that the parser has entered State 6 on the State Description Report (State 22 after renumbering).

Every time the parser performs a Reduce action, it prints

          reduce   N  (X)  symbol: definition

before executing the associated recognition action. The message says that the parser is about to Reduce by grammar rule N (renumbered to X). The symbol is the symbol that was recognized in the reduction and the definition is the definition used for that symbol. If "yydebug" is greater than zero, you will also see

          pops to  M  (Y)

This states that after the reduction, the state on top of the state stack was State M (renumbered to Y); this is the state before executing the Goto after the Reduce. The Goto itself does not generate a message. For example,

          reduce   4 (3) x : y x
          pops to  2 (1)
          state    7 (11)

says that the parser reduced by grammar rule 4 (renumbered to 3). This rule had the form

          x : y x

After reducing, the grammar popped to state 2 (renumbered to 1), then did a Goto to state 7 (renumbered to 11).

If YAY encounters a $select operator while executing the action of a rule definition, you will see the message

          select instead symbol : definition

where symbol and definition give the rule that was selected by the $select operator.

During the course of error recovery, you may see the message

          Error recovery pops state N (M), uncovers X (Y)

This states that an Error operation popped state N (renumbered to M) off the stack, and therefore moved to state X (renumbered to Y). You may also see

          Error recovery discards T (C)

where T is the name of a token and C is its corresponding integer value. If your error recovery code explicitly changes the lookahead token (without going through "yylex"), you will see the message

          yychar changed to T (C)

where T is the name of the new token value and C is its corresponding integer value.

13.5.2 Converting Token Integers Into Strings


          const char * yyptok(int i)

returns a printable string representing the token with the given integer value. As mentioned in Section 3.1.1, named tokens are associated with a positive integer value. For example,

          %token INTNUM

associates an integer value with the token INTNUM. If you pass this integer value to "yyptok", it returns the string "INTNUM".

Similarly, all non-terminal symbols are represented by negative integers. If you pass one of these negative integers to "yyptok", it returns a string giving the name of the associated non-terminal symbol.

The string result returned by "yyptok" is stored in a static buffer. Therefore, each call to "yyptok" overwrites the string from the last call.

13.5.3 Displaying Rules


          int yyprule(int rule)

prints the specified rule to the standard output. If the value of rule is less than zero, it is taken to be the negative of the normal rule number (as given in the parser description report). For example,

          yyprule( -3 )

prints Rule 3. If the value is greater than zero, it is taken to be a renumbered rule number, plus one. (Remember that YAY renumbers rules internally for efficiency reasons.)

13.5.4 Displaying States


          int yypstate(int state)

prints the Shift and Reduce operations for the specified state on the standard output. If the value of state is less than zero, it is taken to be the negative of the normal state number (as given in the parser description report). For example,

          yypstate( -1 )

prints the Shifts and Reduces for State 1. If the value is greater than zero, it is taken to be a renumbered state number, plus one.

13.5.5 Displaying Goto Operations


          int yypgoto(int rule)

uses the standard output to print the Goto operations that could occur as the result of Reducing by the specified rule. If the value of rule is less than zero, it is taken to be the negative of the normal rule number (as given in the parser description report). For example,

          yypgoto( -3 )

prints the Goto operations that could occur as the result of reducing by Rule 3. Notice that Goto operations are grouped by rule, even though the Parser Description report associates them with particular states.

If the value of rule is greater than zero, it is taken to be a renumbered rule number, plus one.

13.5.6 Displaying Your Own Debugging Output

By default, the parser calls a function named "yypdebug" to display any debugging information produced as a result of defining YYDEBUG. If you wish, you can override this by defining your own information-printing function. For example, you might create your own version of the function which prints the current input line number and "yylval" value as well as the normal information displayed by "yypdebug".

To override the default function, set the symbol YYPDEBUG to the name of your new function. For example, if your function is named "mypdebug", you would specify

          #define YYPDEBUG mypdebug

in the declarations section of your YAY input. The actual source code for your function must appear in the programs section of the YAY input.

If you define this kind of function, it should have a prototype of the form

          void mypdebug(int what, int i, int j)

The "what" argument indicates the event that has just taken place. For example, there is a code indicating that the parser has changed to a new state, a code indicating that the parser has performed a Reduce operation, and so on. The other two arguments provide additional information about the event.

For many events, one or both of the arguments "i" and "j" may refer to states. The values always specify the number of the state after remapping (states are renumbered for the sake of efficiency during execution). To obtain the original state, use the "yygsmap" function. For example, if "remapped" is the number of a remapped state

          yygsmap(remapped)

is the original number for the same state (the number that appears in the Parser Description report).

The following list gives the possible values for the "what" argument, and the corresponding meanings of "i" and "j".

YYdebug_state: The parser has just changed state. "i" is the new state and "j" is the integer value of the lookahead character.
YYdebug_read: The parser has read a token. "i" is the current state and "j" is the integer value of the token read.
YYdebug_reduce: The parser has performed a Reduce operation. "i" is the new state after the Reduction and "j" is the number of the rule used in the Reduction.
YYdebug_select: The parser has selected a rule because of a $select operation. "i" is the number of the rule that has been selected.
YYdebug_char: The parser has detected that "yychar" has had its value changed (without going through "yylex"). For example, this might happen if an error-handling process replaces the real lookahead token with some other value. "i" gives the old value of "yychar" and "j" gives the new value.
YYdebug_error: The error recovery process has popped a state off the stack. "i" is the state that was popped off and "j" is the new state.
YYdebug_discard: The error recovery process has discarded an input token. "i" is the current state and "j" is the integer value of the token discarded.

It is a good idea to model your function on the "yypdebug" function that is automatically generated by YAY. To examine "yypdebug", look at the end of the source code that YAY generates. Note that the default YYPDEBUG checks "yydebug" before calling "yypdebug". You may want your code to do likewise.

13.5.7 Important Symbols For Debugging

Debugging a parser produced by YAY can be a very difficult task, since the code is only partly produced by user input. The remaining code is standard software produced by YAY itself. The situation is aggravated by another problem: the state and rule numbers shown in the State Description report are not the same as the numbers used when the parser actually runs. In the interests of optimization, the parser sorts the states into a more convenient order. Thus the internal state number used by the program is usually not the same as the external state number known to the user.

To help those who are examining parser code using a symbolic debugger, this section discusses some important variables that the parser uses.

Warning: The following information is provided only for debugging purposes. Your code should not depend on these symbols to retain their meanings from one release of YAY to the next. The meanings of these symbols may change without notice.

yyval

holds the value $$ at the time of a Reduction (while the recognition action is being executed). This has the type YYSTYPE.

yystate

is the internal number of the current state.

yyps

points to the current top of the state stack. Thus yyps[0] is the internal number of the current state, yyps[-1] is the internal number of the previous state, and so on.

yypv

points to the top of the current value stack. The entries in this stack have the type YYSTYPE. When a Reduce operation executes a recognition action, this pointer is moved down the stack to the point where

          yypv[1] = $1
          yypv[2] = $2
              etc.

In embedded actions, the offsets into yypv are incorrect. However, there is a variable named yypvt (with the same type as yypv) which always points to the last $N value during a recognition action. Outside of a recognition action, yypvt may not be valid.

yyi

is the internal number of the rule being Reduced by a Reduce action.

13.6 Dynamic Stack Allocation

The manifest YYSSIZE is used to determine the size of the state and value stacks used by "yyparse". YYSSIZE gives the maximum number of elements that these stacks will be expected to hold; the size of each value element is dictated by the YYSTYPE type and the size of each state element is determined by YAY. The default value of YYSSIZE is 150. By increasing the value of YYSSIZE, you can allow for grammars with a larger number of pending states.

If you #define YYALLOC in the declarations section, the state and value stacks used by "yyparse" in C code will be allocated dynamically and freed before "yyparse" returns. There are two ways of doing this: one using C- style memory allocation and one using C++-style. The C++-style can only be used in C++ source code, while the C-style can be used in either C or C++ code.

In C-style allocation, stack space is dynamically allocated via "malloc". This is the default if you #define YYALLOC.
In C++-style allocation, the value stack will be allocated as an auto object using the new operator and deallocated using delete. To obtain this type of allocation, you must #define a symbol named YYCPLUSPLUS in the declarations section of your code, as well as YYALLOC. For further information about using this type of allocation, see Appendix D.

In both styles of allocation, "yyparse" makes itself re-entrant by saving a number of externals when it begins execution and restoring them upon completion. The externals involved are

          yylval   yyval   yypvt
          yynerrs  yychar  yyerrflag

If you "longjmp" out of "yylex" or "yyparse" (due to an action), the externals are not restored, and "yyparse" will not be re-entrant.

If you use YYALLOC, "yyerror" will be called if "malloc" or new fails to allocate space for the state and value stacks. The argument passed to "yyerror" is

          "Not enough space for parser stacks"

If you #define YYALLOC with a value greater than 10, the parser allocates the state and value stacks dynamically, beginning with a size of YYSSIZE. If this is not big enough to hold the two stacks, "yyparse" attempts to use "realloc" (in C-style allocation) or "recalloc" (in C++- style) to grow the stacks by the amount given by YYALLOC. For example, if YYALLOC is 20, "yyparse" tries to grow the stacks to a size that will allow 20 additional elements (whenever "yyparse" needs more space for the stacks). "yyerror" is called if the call to "realloc" or "recalloc" fails to allocate additional space.

If you set up your parser in such a way that it may grow the stacks, you must be careful not to take a pointer into a stack in one action and use that pointer inside a different action. The reason is that the stack may be grown using "realloc" or "recalloc" in between the two actions. Since "realloc" and "recalloc" may actually move the entire stack, the pointer will no longer be valid. Thus you should not create pointers with expressions like

          &($1)

and expect those pointers to be valid in other actions. Within the code of a single action, however, such pointers will remain valid.

If you #define YYSTATIC, both the state and value stacks will be static. Otherwise, the state stack will be "auto" (allocated on the program stack) and the value stack will be static. Defining YYALLOC saves both stack space and static space; defining YYSTATIC saves stack space.

13.7 Synchronizing the Lookahead

If you #define YYSYNC, the parser will always have a lookahead token when it performs a Shift or Reduce action. If the symbol is not defined, the parser will only obtain a lookahead token if the value of the token is needed. In particular, it will not obtain a lookahead token for states with only a default Reduce action.

You would define YYSYNC in situations where you wanted to keep track of the line number on which various situations occurred (e.g. errors). If "yylex" always does a lookahead, you know that the line number of the token you are working with is the line number of the second last token read. If "yylex" sometimes does not do a lookahead, you don't know if the current line number in "yylex" is the line number for the current token or for a lookahead token.

You would avoid defining YYSYNC in situations where some actions do their own reading. For example, suppose that /* is a token named SLASHSTAR that indicates the beginning of a comment. You could create an action for that token which reads all input up to a closing */. With this kind of action, you would not want "yylex" to read a lookahead token, since that token would be the first token inside the comment, not the first token after the comment.

Ensuring no lookahead can be difficult, but in this case, %default is sufficient:

          comment : SLASHSTAR %default { eatcomment(); }

When the above rule Reduces the comment, the recognition action calls a function named eatcomment which discards the remainder of the comment (after the /*).

Note: The above definition of comment can be dangerous. If the parser is recovering from an error, it may begin discarding tokens, including the /* that starts the comment. In this case, the parser may start trying to parse the comment, with chaotic results. If possible, it may be best for the lexical analyzer itself to skip comments and only return uncommented data to the parser.

13.8 YYSHIFT

The generated parser program invokes a macro named YYSHIFT whenever the parser performs a Shift action in response to a token. The macro is invoked without arguments. By default, YYSHIFT is defined to do nothing; however, users can redefine YYSHIFT if desired, in the token definition part of the YAY input.

As an example of how you might use YYSHIFT, consider a situation where input is freeform and may be split over many lines of text, including the insertion of blank lines in the input. You may want to keep a count of the number of lines you've read as you parse the input so that you can provide error messages like "Line 12: Syntax error".

When the parser finishes reading a line, it often has to read ahead to the next token before it can decide whether to Reduce what it already has or keep on reading additional input. Since there may be blank lines in the input, this means that the parser may actually read ahead several lines before finding the next token. When it sees the next token, the parser may decide to Reduce what it has seen already before Shifting in response to the token.

While the reduction is happening, you want your line count to reflect the last line of input. You do not want the line count to reflect the line where the parser found the new token, because the parser isn't really using that token yet; it's working with older input. You only want to update the line count when the parser is ready to Shift on the new token.

That's where YYSHIFT comes in. You can define the YYSHIFT macro to update the line count at the point when the token is actually used, not when it is first read.

YYSHIFT is not invoked if a token is discarded (because of error handling or yyclearin). It is only invoked when the parser Shifts on a token.

14. Using YAY Interactively

YAY offers an interactive mode in which you can test the response of your grammar to various inputs. In order to invoke the interactive mode, specify the +INTeractive option on the YAY command line, as in

          yay gram.y parser=gram.c desc=gram.d +int

YAY will process the grammar normally (producing parser code and a parser description report as requested on the command line). Then YAY will prompt you for interactive input. The rest of this chapter describes the types of input accepted during an interactive session.

14.1 General Principles

This section describes general principles of entering interactive YAY commands.

14.1.1 Getting Help

Entering

          Help

in response to the prompt for input displays a list of all the commands that can be entered while you are in interactive mode.

14.1.2 Abbreviating Commands

In the list of commands, command names are displayed in mixed upper and lower case, as in

          Null_Reduce_items

When you are typing in a command, you may omit any or all of the letters shown in lower case but you must enter all the letters shown in upper case. Underscore characters can also be omitted. For example, Null_Reduce_items may be abbreviated to any of

          null_reduce
          nullreduce
          nullred
          nr

and so on. The commands themselves may be entered in either upper or lower case.

14.1.3 Multiple Commands on a Line

In most cases, you can enter several YAY commands on the same input line. For commands that always take the same number of arguments or commands for which you are specifying all arguments explicitly, you can just enter the commands one after the other, as in

          Resolve Shift   Expand <   Act

(The above line includes extra space between commands to make it easier for you to see where one command ends and the next begins. You do not have to include this extra space when you are typing your own input.)

If a command takes an arbitrary number of arguments or if you have omitted some optional arguments, you must mark the end of the command with a semicolon (or by starting a new line). For example, you might write

          Expand ; Act

This example omits the argument for the Expand command, so there must be a semicolon to mark the end of the command.

14.2 Examining a State

Interactive mode lets you examine any state in the grammar. To begin examining a state, enter the command

          Look num

where num is the number of the state you want to examine. For example,

          l 10

displays information about state 10.

The Look command also accepts keywords instead of a state number. The possibilities are:

Look start: displays the first state in the grammar (number 0).
Look accept: displays the state containing the Accept action (the state that is entered when the parser has successfully finished parsing the input).
Look conflict: displays the next state (after the current state) which contains a conflict.
Look unreachable: displays the next state (after the current state) which is unreachable.
Look here: displays the current state.

When you look at a state, YAY displays information about the state, in a format similar to the state description in the parser description report. For example, you might see

          Kernel Items:
              (7)  typelist: list ','. type;
          Shifts:
          ?   type         -> 6
          No Reduces

This particular state Shifts if it finds a type; it has no Reduce actions.

Notice that the output shown above has a '?' in front of the Shift operation. This code will be discussed in the Action Codes section later in this chapter.

The Look command does not change the current state. For example, if you issue several

          Look conflict

commands in a row, you will always see the next state after the current one that contains a conflict. Since the Look commands do not change the current state, you will keep seeing the same conflict-containing state.

14.3 The Jump Command

The Jump command puts the parser into a specified state. For example,

          jump 3

puts the parser into state 3. Once you have put a parser into a particular state, you can examine that state in various ways (as discussed in later sections of this chapter).

When you Jump to a new state, Jump displays information about the new state, as if you had just done a Look command. The difference between Jump and Look is that Jump actually moves to a specified state, while Look just displays information about a state without leaving the current state.

Jump accepts the same command formats as Look:

          Jump NUMBER
          Jump start
          Jump accept
          Jump conflict
          Jump unreachable
          Jump here

Each time Jump moves to a new state, it displays the same information as Look about the new state.

When you first start using YAY in interactive mode, the state is "undefined". Therefore, your first command in an interactive session is usually a Jump command so that you can put the parser into an initial state.

The Jump command clears any states currently stored in the state stack, but leaves the input stream unchanged. The state stack and the input stream are described in full detail later in this chapter.

14.4 Displaying Information about a State

The following commands display information about the current state:

Kernel_Items

displays the kernel items associated with the state. Each kernel item is a rule, with a "." in the middle of the rule showing how much of the rule may have been seen before reaching this state, as in

        stmt_list: stmt_list ';'. stmt

CLOsure_items

displays all the non-kernel items associated with the state. Non-kernel items are displayed in the same way as kernel items. The non-kernel items represent any non-terminal symbols that the parser is attempting to start parsing in this state. For example, suppose the current state has a single kernel item:

        stmt_list: stmt_list ';'. stmt

This says that the parser is current gathering a list of statements separated by semicolons, and has just found the semicolon at the end of a statement. The non-kernel items will be definitions for all the things that could be found at this point in input, as in

        stmt:             .while_statement;
        stmt:             .for_statement;
        while_statement:  .WHILE while_body;
        for_statement:    .FOR for_body;

and so on. Since the parser hasn't parsed any input yet, there are a large number of items that might come next. However, such items are all implied by

        stmt_list: stmt_list ';'. stmt

Null_Reduce_items

displays any null reduces associated with the state. These are a subset of the non-kernel items.

Disused_Items

displays any disused items associated with the state. These are possibilities that will never be used because they have a lower priority class than other items for the same lookahead token, or else a preceding or following state prevents the use of the items.

ITems

displays all of the above items associated with the state.

SHifts

displays all the Shift operations associated with the state.

GOtos

displays all the Goto operations associated with the state.

ReDuces

displays all the Reduce operations associated with the state.

CONflicts

displays any conflicts associated with the state.

PreDecessors

displays all states that can Shift or Goto the specified state.

ALL

displays all of the above information (all items, operations, and conflicts).

14.4.1 The DisPlay Command

The DisPlay command specifies what information you want to see when you issue a Look or Jump command. Its format is

          DisPlay item item item ...

where all the items are commands that display information about states. For example,

    display kernelitems shifts gotos reduces conflicts

tells YAY to display the given information when you Look at a state.

          display all

displays all the available information. When you first start using YAY in interactive mode, the default is display kernelitems shifts reduces Once you have used a DisPlay command to specify what information you want to see, YAY shows you that information every time you use Look or Jump. You do not have to issue another DisPlay command explicitly.

14.5 Displaying Other Information

The commands discussed in the following sections display information about the current state.

14.5.1 The Effect of an Input Item

The EFfect command displays what action would be taken in this state if the lookahead symbol was a given token or symbol. For example,

          EFfect expr

displays what action would be taken in this state if the lookahead symbol was expr. For example, YAY might say that the effect would be to Goto another state. In an EFfect command, the symbol or variable must be given as it appears in the grammar. For example, if the grammar uses the token NUM to stand for integers read from the input, you would write

          EFfect NUM

instead of entering an actual number. In interactive mode, YAY does not call your "yylex" routine to analyze the input; you must use the symbolic names specified in your grammar.

You can specify literal input characters by enclosing them in single quotes. For example,

          ef '*'

tests the effect of a '*' as the next character in input.

If the argument of EFfect is a not a symbol or token, YAY tries to interpret it as a literal token (putting quotes around the token if necessary). For example, consider

          ef X

If there is a token or variable named X, this tests the effects of having that token or variable as the lookahead symbol. If there is no such token or variable, the command is interpreted as

          ef 'X'

The exception to this principle is the semicolon character.

          ef ;

receives an error because the semicolon is interpreted as the end of the EFfect statement, not an argument for it. To avoid the error, put the semicolon in quotes.

14.5.2 Displaying Grammar Rules

The RUle command displays one or more rules from the grammar. The most common form is

          RUle variable

where variable is the name of a non-terminal symbol. YAY then displays all the grammar rules which define that symbol. You can also use

          RUle number

where number is an integer. This displays the rule with that number.

14.5.3 Selective Display Commands

The selective display commands display rules or states which meet certain criteria.

The Rules_Not_Reduced command displays any rules that are not reduced in any parsing states. For example, if a rule provides a definition of symbol SYM and no other rule in the grammar refers to SYM, the rule defining SYM would not be reduced. The command simply has the format

          Rules_Not_Reduced

The States_Reducing_Rule command displays the numbers of all the states that Reduce using a specified rule number. For example,

          srr 3

displays the numbers of all the states that reduce using grammar rule 3.

The States_Starting_Rule command displays the numbers of all the states where a particular rule number could begin. For example,

          ssr 3

displays the numbers of all the states where Rule 3 could begin. (This is identical to saying that the command displays all states containing the non-kernel items for Rule 3.)

The States_Containing_Item command displays the numbers of all the states that contain a given item pattern. The specified pattern must contain a '.' standing for the current parsing point. For example,

          sci expr.'+'

displays the numbers of all the states which contain items with expr to the left of '.' and '+' to the right. As with the EFfect command, any argument that is not recognized as a symbol or named token is interpreted as a literal token (except for '.' which always stands for the current parsing point).

The States_Containing_Item command has a second form, taking two numeric arguments. The first argument gives the number of a grammar rule; the second gives the position of '.' in this rule. For example, suppose Rule 3 is

          expr: expr '+' expr

Then the command

          sci 3 0

displays the numbers of all states which contain items of the form

          expr: .expr '+' expr

Similarly, the command

          sci 3 1

displays the numbers of all states which contain items of the form

          expr: expr.'+' expr

In the States_Reducing_Rule, States_Starting_Rule, and States_Containing_Item commands, it is possible that the relevant rule or item is disused or deleted in the state. In this case, the number of the state is enclosed in parentheses in the list of states displayed by the command.

14.6 Test Parsing

Interactive mode lets you submit sample input to the grammar and then follow how the input is parsed. You will see YAY move from state to state in response to the input, giving you a chance to make sure that the input is actually being interpreted the way you want.

14.6.1 Input Format

The input for test parsing is specified as a sequence of tokens. If your grammar refers to tokens by symbolic names, the input should use the symbolic name rather than the sequence of characters that will be used in real input. For example, suppose that you define a token named INTNUM which stands for any integer numbers recognized by "yylex". When you are using interactive mode, you would specify INTNUM in your sample input rather than an actual number.

Therefore, suppose you have a grammar with the rules

          expr : '(' expr ')'
               | expr '+' expr
               | expr '-' expr
               | expr '*' expr
               | expr '/' expr
               | INTNUM
               ;

Your sample input might be

          INTNUM + INTNUM * ( INTNUM / INTNUM )

YAY will then show how your grammar responds to this input.

To specify sample input, you use the Token command. This consists of the keyword Token, followed by any number of tokens. For example,

          token INTNUM + INTNUM * ( INTNUM / INTNUM )

specifies the sequence of tokens corresponding to the example discussed above. As shown, tokens are separated by white space.

The Token command accepts non-terminal symbols as well as tokens. For example, you could write

          token expr + INTNUM * expr

This combines a number of tokens and non-terminal symbols in the same input.

You can use the Token directive to add tokens to the input stream at any time. Token always adds tokens to the end of the current input stream.

14.6.2 Removing Input

The command

          UnToken

removes the last input token or non-terminal symbol from the input stream. You can clear the existing input stream by issuing enough UnToken commands in a row.

14.6.3 Simple Test Parsing

To begin parsing input, you must put your session into an appropriate state. For example, you might use

          jump start

to jump to the state that corresponds to the beginning of reading input.

Once you have set your session to the desired state, the command

Act

tells YAY to take action based on the current state and the contents of the input stream. For example, the program may Shift to a new state based on a lookahead token, or Goto a new state based on a lookahead non-terminal symbol.

The output after an Act command contains the following information:

A line telling the action that YAY took. For example, you might see
```
          Shift INTNUM, to state 1
```
indicating that the parser Shifted on an INTNUM token and moved to state 1. For Goto actions, you might see
```
          Goto expr, to state 3
```
This states that the parser performed the Goto based on an expr symbol, and moved to state 3. For Reduce actions, you might see
```
          Reduce 7 pops 3 states, generating expr
```
This says that the parser Reduced using grammar rule (7). In so doing, it popped three states off the state stack and reduced them to an expr non-terminal symbol. The expr symbol will now appear at the beginning of the input stream.
After the line telling what action YAY took, YAY displays the usual information about the current state (as specified by any previous DISPlay command). This is usually a new state (for example, if you Shifted to a new state). However, it may be the same state if you performed a null (zero-length) Reduce.
After information about the state, YAY displays a line which shows the state stack as well as the current contents of the input stream. For example, you might see
```
          s#0 INTNUM s#1 ^ '+' INTNUM
```
The state stack is shown to the left of the '^' character and the input stream is shown to the right. Therefore, the state stack part is
```
          s#0 INTNUM s#1
```
This says that the parser has pushed state 0 onto the state stack; the corresponding value put into the input stream is an INTNUM. The parser is now in state 1.
The input stream part is
```
          '+' INTNUM
```
This shows the input that still hasn't been read.

By issuing one Act command after another, you can trace the way in which the parser handles the sample input. For example, the following shows how YAY responds to a series of interactive commands. The commands are shown without indentation and YAY's responses are indented.

jump start
        Kernel Items:
        Shifts:
        ?   '('     -> 2
        ?   INTNUM  -> 1
        No Reduces

token INTNUM + INTNUM
        s#0   ^   INTNUM '+' INTNUM

Act
        Shift INTNUM, to state 1
        Kernel Items:
          (7)     expr:  INTNUM.
        No shifts
        Reduces:
        >D    7 as default
        s#0 INTNUM s#1   ^   '+' INTNUM

Act
        Reduce 7 pops 1 states, generating expr
        Kernel Items:
        Shifts:
            '('     -> 2
            INTNUM  -> 1
        No Reduces
        s#0   ^   expr '+' INTNUM

Act
        Goto expr, to state 3
        Kernel Items:
          (0)     $accept:  expr.$end
          (3) L2  expr:  expr.'+' expr
          (4) L2  expr:  expr.'-' expr
          (5) L1  expr:  expr.'*' expr
          (6) L1  expr:  expr.'/' expr
        Shifts:
            '*'     -> 6
        >   '+'     -> 8
            '-'     -> 7
            '/'     -> 5
        No Reduces
        s#0 expr s#3   ^   '+' INTNUM

Act
        Shift '+', to state 8
        Kernel Items:
          (3)     expr:  expr '+'.expr
        Shifts:
            '('     -> 2
        >   INTNUM  -> 1
        No Reduces
        s#0 expr s#3 '+' s#8   ^   INTNUM

Act
        Shift INTNUM, to state 1
        Kernel Items:
          (7)     expr:  INTNUM.
        No shifts
        Reduces:
        >D    7 as default
        s#0 expr s#3 '+' s#8 INTNUM s#1   ^   No input
Act
        Reduce 7 pops 1 state, generating expr
        Kernel Items:
          (3)     expr:  expr '+'.expr
        Shifts:
            '('     -> 2
            INTNUM  -> 1
        No Reduces
        s#0 expr s#3 '+' s#8   ^   expr

Act
        Goto expr, to state 13
        Kernel Items:
        (3) p2 expr:expr'+'expr.['*' '/' '+' '-'] -[]
                         / [$end '*' '/' '+' '-' ')']
        No shifts
        Reduces:
        ?D    3 for [ ^ error $catch INTNUM '(' ]
        s#0 expr s#3 '+' s#8 expr s#13   ^   No input

Act
        More lookahead tokens required

Note that the default DisPlay settings do not show Goto actions. If you plan to use Act extensively, you might want to change the DisPlay to include Gotos as well as Shifts and Reduces.

14.6.4 Action Codes

When YAY displays information about the actions associated with a state, it may mark the actions with one or more code letters. For example, the preceding output included

        Shift INTNUM, to state 1
        Kernel Items:
          (7)     expr:  INTNUM.
        No shifts
        Reduces:
        >D    7 as default
        s#0 INTNUM s#1   ^   '+' INTNUM

The Reduce action is marked with the code >D at the beginning of the line. YAY uses the following characters to label actions like Shifts, Reduces, and Gotos:

>: The action that the parser will take, based on the lookahead symbol.
*: An action that is valid but not used by the generated parser because of an unresolved conflict. If you try to use Act to execute such an action, YAY does nothing; instead, YAY suggests that you resolve the conflict using the Resolve command (discussed later in this chapter).
?: Used to label a number of actions, when the parser doesn't have enough input in the input stream to decide which action to take next.
X: Used to label actions which have been rejected for some other reason than the ones listed above. For example, the parser never Reduces on error, so such actions will be marked with X.
blank: This action will not be taken (based on the current input stream).

If any of the above action codes is followed by a D (for example, >D), the D indicates that the action is the default Reduce.

14.6.5 Displaying the Stack

At any point during a test parse, the command

          Stack

displays the current contents of the state stack and the input stream.

14.6.6 Quick Runs

The command

Run

is similar to a sequence of Act commands. It performs one action after another until it reaches an error or ambiguous state (e.g. one where there is a conflict), or else it runs out of input. Run only displays two lines for each action taken: a line describing the action and a line showing the contents of the state stack and the input stream. Here is the output from Run for the same example given in the section Simple Test Parsing.

          Shift INTNUM, to state 1
          s#0 INTNUM s#1   ^   '+' INTNUM
          Reduce 7 pops 1 states, generating expr
          s#0   ^   expr '+' INTNUM
          Goto expr, to state 3
          s#0 expr s#3   ^   '+' INTNUM
          Shift '+', to state 8
          s#0 expr s#3 '+' s#8   ^   INTNUM
          Shift INTNUM, to state 1
          s#0 expr s#3 '+' s#8 INTNUM s#1   ^   No input
          Reduce 7 pops 1 states, generating expr
          s#0 expr s#3 '+' s#8   ^   expr
          Goto expr, to state 13
          s#0 expr s#3 '+' s#8 expr s#13   ^   No input
          More lookahead tokens required

14.6.7 Simulating Errors

The command

          Error

simulates an error. YAY will first try to Shift on $catch; if the current state cannot do that, YAY goes into error recovery mode. This means that YAY pops states off the stack until it finds a Shift on error, then performs that Shift operation. This mimics what the parser would do if it found an error.

For example, you could use several Act commands to reach a certain situation in parsing, then use Error to simulate an error at that point. Successive Error commands let you track how the parser responds to a sequence of errors.

The command

          DisCard

discards the next item on the input stack. This can be used to get rid of an unshiftable token after an error. Discarding a token mimics the second stage of error recovery, after the parser has popped back to a state that can Shift or error. The parser discards input items if it has performed the error Shift and finds itself in a state that cannot Shift or Reduce on the next input value. Therefore, after you simulate an error with the Error command, you should check to see if the current state can Shift or Reduce on the next value in the input stream. If so, perform the Shift or Reduce; otherwise, use DisCard to get rid of the value, and check to see if you can Shift/Reduce on the next value in the stream.

In situations where the Act command cannot Shift or Reduce, Act does nothing. Instead, it suggests that you use Error or DisCard to deal with the problem.

14.6.8 Simulating YYPOP

The command

          POP token

simulates a YYPOP command, popping states off the stack until reaching a state which can perform a Shift on the given token. YYPOP then performs the Shift on the token.

14.6.9 Backing Up

The Back command attempts to reverse the effects of a Shift or Goto operation. The simplest form is

          Back

This pops the state that is currently on top of the state stack and pushes the associated value back into the input stream.

The Back command also has a number of forms that are useful when the state stack is empty. In this situation, you have to provide extra information for the Back command, so that it knows which state to go back to. One form of the command is

          Back number

where number is the number of a state that is a predecessor of the current state. (Recall that the predecessors of a state are all the states that can Shift or Goto to specified state.) If you Back to a predecessor, YAY puts the predecessor state on the state stack. Also, at the head of the input stream, YAY inserts whatever token or variable the predecessor state had to Shift or Goto on in order to reach the original state.

For example, suppose you are in State 10, and suppose that State 6 is a predecessor of State 10. State 6 moves to State 10 by Shifting on the token '+'. Then if you issue the command

          back 6

while you are in State 10, YAY does the following:

Changes the current state from 10 to 6.
Inserts the token '+' at the beginning of the input stream.

The Back command has two other forms to simulate a move back to a predecessor state when the state stack is actually empty.

          Back SYMBOL

moves back to the predecessor state which reaches the current state by Shifting or Gotoing on the given symbol. (Most states only have one predecessor symbol anyway. The exception is the state containing the definition for the $error symbol.) YAY must be able to determine the predecessor state uniquely. For example, if you use

          Back +

there must be one and only one predecessor state which reaches the current state by Shifting on the token '+'.

The final form of Back is

          Back <

This form tells YAY to choose one of the predecessor states for you. YAY tries to choose the state that will lead back to the start state in the shortest number of Back and Expand operations.

14.6.10 The EXpand Command

The EXpand command is similar to Back. Just as Back reverses the effects of a Shift or Goto operation, EXpand reverses the effects of a Reduce operation. EXpand has the format

          EXpand number

where number is a grammar rule number giving a definition of a variable. For example, suppose that the next item in the input stream is expr and that grammar rule 3 is

          expr : expr '+' expr

In this situation,

          EXpand 3

creates an "expansion" of the input expr using rule 3. This means that YAY removes expr from the input stream and processes it as if the stream had contained the definition of expr given in rule 3. The items

          expr
          +
          expr

are placed into the input stream, and corresponding states are pushed onto the state stack. In other words, EXpand reverses the Reduce operation by replacing the Reduced variable with its original definition.

If the grammar only has one rule defining the symbol at the beginning of the input stream, you can just say

          EXpand

without specifying the rule number.

The command

          EXpand <

works the same as other forms of EXpand, except that it chooses an expansion for you. In this case, YAY tries to choose an expansion that leaves you closest to the start of input (the start state).

The major purpose of EXpand is to help you avoid a lot of typing. For example, suppose you want to test-parse a complicated grammar rule. You don't have to type in all the tokens that make up the rule definition; you just enter the variable name itself and use EXpand to expand that according to a particular definition.

14.6.11 Generating Input

The command

          PreFix

attempts to calculate the shortest stream of input tokens that could bring you to the current state from the start state. Essentially, PreFix results in a series of

          Back <
           and/or
          Expand <

commands leading back to the start state. This means that the input stream is set up with an appropriate sequence of tokens leading to the state where you issued the PreFix command. At the end of this process, YAY leaves you in the Start state: the state corresponding to the start of reading input. Therefore, you should be able to issue successive Act commands to see how the parser handles the generated input and returns to the state you were in when you issued the PreFix command.

There are cases where YAY cannot determine input tokens that would have the desired effect.

14.6.12 The Avoid Command

The Avoid command is used before a PreFix, Back <, or Expand < command to control the input that PreFix generates. Avoid can have any of the following formats:

          AVoid variable
          AVoid token
          AVoid rulenumber

If you issue one or more of these AVoid commands and then a PreFix command, the PreFix command will try to avoid generating input that contains a specified variable or token, and will try to avoid generating input that uses the definition in rule rulenumber. The same applies if you issue several AVoid commands and then a Back < or Expand < command.

YAY automatically sets a default of

          AVoid error
          AVoid $catch

You cannot override these settings, but you can issue AVoid commands of your own to add to the list of items that YAY tries to avoid.

AVoid only says that you prefer not to use the given item. If PreFix cannot find any other way to generate input that reaches the current state, PreFix will use the items you have asked to avoid.

14.6.13 Resolving Conflicts

The Resolve command lets you resolve conflicts that you may encounter during a test parse. For a Shift/Reduce conflict, the command

          Resolve shift

says that the current conflict should be resolved by shifting. For a Reduce/Reduce or Shift/Reduce conflict, the command

          Resolve number

says that the conflict should be resolved by Reducing, using the grammar rule given by number. For example,

          Resolve 3

resolves the conflict, Reducing with grammar rule #3.

The command

          Resolve *

tells YAY to resolve the conflict using the standard disambiguating rules of the parser. This means that YAY will take the action marked with '>' in the state description.

The command

          Resolve variable

reduces using the rule that defines the given variable. In this case, there must only be one rule that defines the variable.

In all forms of Resolve, the specified action must be valid in connection with the current lookahead symbol.

14.6.14 Clearing the State Stack and Input Stream

The command

          Clear

removes all the current contents of the input stream, in preparation for testing a completely new sequence of input. It also clears the state stack.

14.7 Miscellaneous Commands

This section describes a number of other interactive YAY commands.

14.7.1 Setting the Width of Output Lines

The Width command sets the maximum width for the output lines that YAY displays.

          Width number

specifies the maximum width to be the given number of characters. The default is

          Width 80

14.7.2 Setting the Prompt String

The PROMPT command changes the string that YAY uses to prompt you for more input. For example,

          PROMPT \33[7m YAY>\33[0m

changes the prompt to a string that will stand out nicely on an ANSI terminal. Notice that there are no quotes around the specified string: PROMPT simply uses everything from the first non-whitespace character to the end of the line. If you specify quotes, the quotes will be displayed as part of the prompt. Similarly, a semicolon character is taken to be part of the prompt string, not the end of the PROMPT command.

PROMPT accepts all the usual escape sequences of C (\n, \t, \ddd, \xXX and so on). It also recognizes the following special constructs:

%s represents the current state number
%d represents the distance of the current state from the starting state.
%e represents the error handling mode (if any).
%% represents a literal % character.

The default prompt is

          PROMPT YAY state %s %e%d>

14.7.3 Executing a System Command

The ! command executes a system command. For example, you could use ! to invoke a text editor to read the original input file or the description file that YAY produces. The format of ! is

          ! command line

Note that the command line consists of everything that comes after the ! (up to the end of the input line). You do not have to enclose the command line in quotes. A semicolon is taken to be part of the specified command line, not the end of the ! command.

Appendix A: Examples

This appendix gives examples of YAY input.

A.1 A Simple Example

The following grammar implements a simple string manipulation language. It has two types of objects: strings and variables. Strings can have a maximum of 100 characters and are enclosed in double quotes. Variable names are a maximum of eight characters long. We have omitted the recognition actions of the grammar rules in order to keep the example as simple as possible.

There are three operations:

= assigns a string to a variable;
JOIN or + concatenates strings or the contents of variables;
PRINT outputs strings or the contents of variables.

Every statement is a single input line and blanks are used to separate tokens on the line. Multiple = assignments are permitted as in

          A = B = "hello"

The keyword PRINT can only appear once on the line.

If a line does not contain an assignment, the value of the expression on the line is merely printed out. This means that it is valid to have a line that only contains a string or a variable name. The end of the input file marks the end of input.

Here are the declarations section of the YAY input and the "yylex" routine (both written in C).

%{   /* declarations */
#include <stdio.h>
#define MAXSTRING 100
char str[MAXSTRING];
#define MAXVAR 8
char var[MAXVAR];
%}

%token VARIABLE STRING
%nonassoc PRINT
%right '='
%left JOIN
%union {
     char *cstr;
}
%%

program : stat
        | program stat
        ;
stat : printexpr '\n'
     | expr '\n'
     ;
printexpr : PRINT expr ;
expr : VARIABLE '=' expr
     | JOIN expr expr
     | STRING
     | VARIABLE
     ;
%%

int yylex()
{
    char c, *stringet(), *wordget();
    extern YYSTYPE yylval;
    while ( (c=getchar()) == ' ' )
           /* skip blanks */;
    switch(c) {
      case '\n':    /* end of line */
          return('\n');
      case EOF:     /* end of file */
          return(0);
      case '+':     /* same as JOIN */
          return(JOIN);
      case '"':     /* start of string */
          yylval.cstr = stringet();
          return(STRING);
      default:      /* keyword or variable */
          yylval.cstr = wordget(c);
          if ( !strcmp("JOIN",yylval.cstr) )
              return(JOIN);
          else if ( !strcmp("PRINT",yylval.cstr) )
              return(PRINT);
          else return(VARIABLE);
    }
}

char *stringet()
{
    extern char str[];
    int i;
    for ( i=0; i < MAXSTRING; i++ ) {
        str[i] = getchar();
        if ( (str[i]=='"') || (str[i]=='\n') )
            break;
    }
    str[i] = '\0';  /* mark end of string */
    return(str);
}

char *wordget(char c)
{
    extern char var[];
    int i;
    var[0] = c;  /* first letter obtained already */
    for ( i=1; i < MAXVAR; i++ ) {
        var[i] = getchar();
        if ( (var[i]==' ') || (var[i]=='\n') )
            break;
    }
    var[i] = '\0';
    return(var);
}

void yyerror(char *s)
{
    fprintf(stderr,s);
}

Note that the lexical analyzer always flags the keywords JOIN and PRINT as operations. This means that you cannot use JOIN or PRINT as variable names. In fact, this is a general limitation of LALR(1) parsers. It is very difficult to have names that are keywords in some contexts and variable names in others, without a great deal of fiddling. Therefore, we recommend that you resign yourself to having reserved keywords, i.e. keywords forbidden for use as variable names.

A.2 An Example of YYPOP

Instead of using "fgetpos" and "fsetpos" to seek within input, the following example saves the lookahead token and its value (if any) in $N values on the stack, and only asks "yylex" for a seek address when required.

/* Example for test parsing */
%union {
    struct {
        int token;
        lex_position seek;
        int suppress;
    } lexsave;
    ....
}

%type <*> savelex1
%type <lexsave> savelex2
%token <> IS_EXPR IS_DECL
%type <>(<>) couldbe test
            /* macros to save some typing */

%%

/*
 * savelex1 is a helper whose duty is to return,
 * as its $$ value, thesemantic value of the lookahead
 * token, if any. If there is no lookahead, it nevertheless
 * copies the value.
 *
 * savelex2 is another helper whose duty is to return
 * as its $$ value the syntactic value of the lookahead
 * token, if any, as well as the lex package's current
 * seek address. If there is no current lookahead, the
 * saved value will be negative. %default is used here to
 * prevent the parser from obtaining a lookahead token
 * between reducing savelex1 and savelex2.
 *
 * Finally, the suppress flag, which prevents yyerror from
 * displaying any syntax error messages, is saved and set.
 * This flag is needed so that errors which occur during
 * the trial parse do not display messages; the assumption
 * is that a *real* parse will follow, and that will
 * display the message.
 *
 * If the <lexsave> struct is making YYSTYPE too
 * big, you can add more savelexN values and split the
 * saved data among them.
 */
savelex1: { $$ = yylval; };
savelex2: %default {
               $$.token = yychar;
               $$.seek = lextell();
               $$.suppress = suppress;
               suppress = !0;
          };

/*
 * The test and couldbe macros are used by the rules that
 * handle the parsing decision once it is made.  They run
 * in parallel so that there is a state containing items
 * like:
 * test(V15):    savelex1 savelex2.ambiguous $42
 * couldbe(T12): savelex1 savelex2.IS_EXPR
 * couldbe(T13): savelex1 savelex2.IS_DECL
 * couldbe(T1):  savelex1 savelex2.error
 *
 * The test(nonterm) macro tries to parse a 'nonterm' in
 * the hope that such parsing eventually does a YYPOP
 * to indicate that it has determined what is being parsed.
 * If the nonterminal passed to the test(x) macro reduces
 * without YYPOPping, the test macro itself pops to the
 * error handler using YYPOP(error). This YYPOP must be
 * in an embedded action so that the states mentioned above
 * are still on the stack during the action. The second
 * action is there to force the first to be an embedded
 * rule.  Thus test(nonterm) never gets reduced; a YYPOP
 * always happens first.  Each such YYPOP leads to one of
 * the couldbe macros.
 */
test(nonterm): savelex1 savelex2 nonterm { YYPOP(error); } {};

/*
 * Once the could(tok) rule is reduced, it discards the
 * lookahead (if any -- this would be the token that caused
 * the parsing decision), and replaces it with the token
 * saved by the savelex... rules; it also causes the lex
 * routine to seek back so that it returns next the same * token that it would have returned after the reduce of
 * savelex2.  This rule needs a %default, since the
 * lookahead at this point is essentially irrelevant junk.
 * This rule also restores the error-message suppress flag,
 * and clears any error mode that might have resulted in
 * termination of the trial parse. It is also possible
 * instead to save and restore yyerrflag as is done with
 * 'suppress'.
 */
couldbe(tok):  savelex1 savelex2 tok %default {
                            yyerrok;
                            yyclearin;
                            yychar = $2.token;
                            yylval = $1;
                            lexseek($2.seek);
                            suppress = $2.suppress;
                        };

/*
 * This is the actual trial-parse construct.
 * It eventually reduces to either a expr_start or a
 * decl_start, but leaves the token stream reset to
 * where it was when it began.
 */
(expr_start|decl_start):
        /*
         * The first rule is the test parse. 'ambiguous'
         * is the non-terminal that actually does the
         * trial parsing. When it has decided either
         * way what the construct is, it does
         * YYPOP(IS_DECL) or YYPOP(IS_EXPR)
         * to signal the decision.
         * Even though test(xxx) never reduces, and thus
         * this rule never reduces either, we need an
         * action here to prevent YAY from complaining
         * about the lack of a $select.
         */
      test(ambiguous) {$select expr_start;}
        /*
         * The following rules use the YYPOP symbol to
         * decide what to $select.  In the case of an
         * error, we use the multi-$select to pick
         * whichever interpretation is valid here, so that
         * the error is re-detected at the right place
         * rather than here. The error handling couldbe
         * call may also be invoked via normal parser
         * syntax error handling if the first token of the
         * trial parsing is invalid.  The %default's are
         * likely superfluous here, as these reduces are
         * probably in LR(0) states.
         */
    | couldbe(IS_EXPR) %default {$select expr_start;}
    | couldbe(IS_DECL) %default {$select decl_start;}
    | couldbe(error) %default {$select expr_start|decl_start;}
    ;

/*
 * The following rule wraps around the test parse to ensure
 * that some decision is made when the parse fails to find
 * any clues. If this is not done, the test(xxx) macro will
 * select the error condition, which will then be handled
 * like a syntax error during the test parse.
 */
ambiguous: expr_or_stmt  { YYPOP(IS_DECL); };

expr_or_stmt: /* etc. */

/*
 * Using the above rule. It should only be used where both
 * interpretations are valid. If only one is valid, just go
 * ahead and parse it straight. You'll get better error
 * handling.
 */
stmt:    expr_start expr semi
    |    decl_start decl semi
    |    if_stat
    |    loop_stat
    |    jump_stat
        /* etc. */

Appendix B: YAY vs. YACC

YAY differs from its predecessor YACC in a number of respects. These are summarized below.
YACC does not support the following.

the generation of code in the B programming language
$catch (Section 8.1)
the %struct construct (Section 9.2)
specifying multiple types in a %type statement
selections and rejections, as described in Section 6.1
YYSERROR, as described in Chapter 8; YACC calls "yyerror" directly, without going through the macro
the <*> and <> types (Section 9.1.1)
%default and %nodefault (Section 8.3)
YYPOP, described in Chapter 10
all the features related to multi-valued grammar rules, described in Chapter 11
macros as described in Chapter 12
%nonnull and %null (Section 13.3)
the use of YYSTATIC, YYSYNC, or YYALLOC (Section 13.6)

If a state has several Reduce actions, YACC always chooses the most popular of these as the default action. YAY's rules for choosing a default are given in Section 8.3. In general, YAY's approach detects errors sooner after they occur, or directs them to states that have actions to handle them.
YACC and YAY differ in their treatment of YYERROR. YAY always reduces and pops the current state off the stack before generating the artificial error condition. YACC doesn't do this. With YACC, you cannot use YYERROR inside a rule with "error" in it; YAY, however, will Reduce using the current rule and then trigger error handling.
YAY handles precedences by substituting appropriate selections and rejections. In certain cases of conflicting precedences, YAY may be able to resolve conflicts that YACC cannot. Therefore, YACC may flag some states as having Reduce/Reduce conflicts where YAY does not.
In case of a three-way Shift/Reduce/Reduce conflict, YACC resolved the situation pairwise (as separate Shift/Reduce and Reduce/Reduce conflicts). The result depended on which pair was examined first, which in turn depended on the order of rules in the grammar. YAY resolves multi-way conflicts all at once.
Output from the two programs differs in many respects.

As a rule of thumb, YAY accepts any grammar that runs through YACC without errors or warnings, and creates a parser that behaves the same for correct input. However, the YAY version usually detects errors faster than the YACC parser, since YACC often performs null Reduces during error- processing while YAY avoids such Reductions.

Appendix C: User-Accessible Variables and Macros

This appendix lists a number of variables and macros that may be referenced from user-written code.

yychar

holds the most recent token value returned by "yylex".

yyerrflag

is used during error recovery mode. If "yyerrflag" equals YYE_NONE, the parser is not in error recovery mode. Otherwise, "yyerrflag" will equal YYE_SHIFT(0), YYE_SHIFT(1) or YYE_SHIFT(2), depending on how often the parser has Shifted successfully since the last Error action.

yylval

should be set by "yylex" to contain the value of the token just obtained. This value should have the YYSTYPE type.

yynerrs

is the number of syntax errors that have been encountered during parsing. This is incremented by 1 each time an error is found. More specifically, it is incremented just before calling "yyserror". This means, for example, that it is not incremented if an invalid token is handled by $catch.

yyrmap

is an array present only when YYDEBUG is defined. It is used to convert internal rule numbers (actually used during "yyparse" execution) to external ones listed in the Parser Description report. For example,

          yyrmap[yyi]

is the external number of the rule being Reduced by a Reduce action.

yysmap

is an array present only when YYDEBUG is defined. It is used to convert external state numbers listed in the Parser Description Report to internal ones (actually used during "yyparse" execution). For example,

          yysmap[reportnum]

is the internal number of the current state.

Appendix D: Producing C++ Source Code with YAY

When producing source code in C++ rather than C, there are a few things you must pay attention to. First, you must specify

          Language=C++

on the command line that invokes the YAY program.

D.1 C++-Style Dynamic Stack Allocation

Section 13.6 discussed dynamic stack allocation. If you #define the symbols YYALLOC and YYCPLUSPLUS in your declarations section, you get C++-style stack allocation. The rest of this section describes how this works.

With this style of allocation, the value stack is declared as an auto of the type

          yyarray<YYSTYPE>

In order for this to work, you must declare a template class of the form

          yyarray<class T>

This template class must have a constructor of the form

          yyarray(size)

          yyarray(size, arg, arg, ... )

The size value is an integer giving the desired size of the array. If there are additional arg values, these are also used in creating the array; for example, they might be one or more initial values to be assigned to array elements.

The yyarray template class should also define a member function of the form

          boolean realloc(int newsize);

This function changes the size of the array to the number of elements specified by newsize. The result should be true if the size is changed successfully and false otherwise.

In connection with yyarray, you may define a macro named YYSTYPE_INIT to be used in creating the value stack. "yyparse" will create the value stack using a constructor for a yyarray<YYSTYPE> array and passing an argument list of the form

          (size YYSTYPE_INIT)

This means that you can use YYSTYPE_INIT to specify any additional arguments needed by the constructor. For example, if the constructor needs a single additional integer argument, you might define YYSTYPE_INIT with

          #define YYSTYPE_INIT  , 42

so that the constructor would be called with the arguments

          (size , 42)

By default, YYSTYPE_INIT is defined to be a null string, so that the constructor is called with only the size argument.

Appendix E: Producing B Source Code with YAY

When producing source code in B rather than C, there are a few things you must pay attention to. First, you must specify

          Language=B

on the command line that invokes the YAY program.

B source code produced by YAY must always be compiled with the +Respectcase option of the B compiler.

The B source code that is produced by YAY begins with a number of comments describing how to "tune" the parser to your specifications. For example, it tells you how to do the same thing that

          #define YYDEBUG 1

does in C code. For further information, see the source code produced by YAY.

When producing source code in C, you are allowed to #define YYALLOC to allocate stacks dynamically. In B, however, stacks are always allocated dynamically, so this feature is not applicable. Similarly, you cannot #define YYSTATIC or YYSYNC in B code.

In B, the escape character for character literals is the star *, not the backslash \. This means that all your code should use the * escape. In particular, you must say

          %token '**'

instead of

          %token '*'

In C code, recognition actions may have their own local declarations. This is not possible in B.

Parsers written in B cannot use the YYSHIFT macro.

YAY Reference Manual

Table of Contents