this blog considered harmful: FIRST and FOLLOW sets, part 1

In my last post, we considered the grammar restrictions imposed by an LL(1) parser. We looked at how FIRST sets for post-increment and post-decrement statements could lead to ambiguity for the parser, and how to fix the problem.

But even a simple language like SPTL has dozens of production rules, and for many of them the FIRST set may not be obvious.

For example:

program = elementblock

elementblock = {simpleelement}+

simpleelement = statement | vardec | functiondef

We've gone through three production rules, and not found anything resembling a SPTL token. And it continues:

statement = assignstatement | breakstatement | callstatement |
closestatement | foreachstatement | forstatement |
ifstatement | openstatement | popstatement |
predecstatement | preincstatement | printstatement |
pushstatement | readstatement | repeatstatement |
shiftstatement | whilestatement

At this rate, it will take forever to get anywhere. We need a method for constructing the set of FIRST sets. Fortunately, there are several rules which will make the job a lot easier.

But first, some definitions:

Terminal: The technical term for token. These are the symbols of the language.
Nonterminal: These are the terms that appear in the BNF but not in the language. A nonterminal will always have a production rule to transform it into something else.

Now we'll define:
A and B as nonterminals
w and x as terminals

And, as in the last post, the following symbols mean:

∅ Empty set
∈ Empty sentence
∩ Intersection - the elements common to both sets
∪ Union - the elements appearing in one or both sets

Now, the FIRST rules:

FIRST(∈) = ∅
FIRST(x) = x
FIRST(w|x) = [w, x]
FIRST(∈ x) = x;
FIRST(w x) = w
but if w can be empty, FIRST(w x) = [w, x]
FIRST(A|B) = FIRST(A) ∪ FIRST(B)
FIRST(A B) = FIRST(A)
but if A can be empty, FIRST(A B) = FIRST(A) ∪ FIRST(B)

So far, so good. But how do we apply them?

We can see that every FIRST rule for a nonterminal gives us another FIRST rule, while all FIRST rules for terminals gives us a terminal (even if it's an empty terminal). Eventually we need to get a FIRST set of terminals for each rule. By building from the bottom up, we can get a FIRST set for each element of the BNF grammar.

For example, let's look at a portion of the expression rules for SPTL:

expr = boolexpr { boolop boolexpr }
boolexpr = relexpr [ relop relexpr ]
boolop = '&&' | '||'
relexpr = [ '-' ] term { addop term }
relop = '<' | '>' | '==' | '<=' | '>=' | '!='
term = factor { multop factor }
addop = '+' | '-'
factor = constant | varname | funcname | '(' expr ')' | negate
multop = '*' | '/' | '%'
negate = '!' factor
constant = boolconstant | numconstant | stringconstant
boolconstant = 'true' | 'false'
numconstant = digit {digit}
stringconstant = plainstring | escapestring
plainstring = ''' text '''
escapestring = '"' text '"'

If we start with expr = boolexpr { boolop boolexpr }, we get FIRST(expr) = FIRST(boolexpr), FIRST(boolexpr) = FIRST(relexpr), and so on. By starting at the bottom, we see:

FIRST(escapestring) = "
FIRST(plainstring) = '

By starting out at the bottom, with production rules that lead immediately to terminals, we can apply the FIRST rules repeatedly for the other production rules, to get a set of terminals for every production rule. So:

FIRST(stringconstant)
= FIRST(plainstring) ∪ FIRST(escapestring)
= [", ']

I've uploaded the complete set of FIRST sets for SPTL in gnumeric format or csv format.

My next post will look at FOLLOW sets.

Labels: FIRST, FOLLOW, grammar rules, recursive descent

this blog considered harmful

Sunday, February 8, 2009

FIRST and FOLLOW sets, part 1

0 Comments:

Links to this post:

About Me

contact me:

Previous Posts