ThinkGeek - Cool Stuff for Geeks and Technophiles

Sunday, February 8, 2009

FIRST and FOLLOW sets, part 1

In my last post, we considered the grammar restrictions imposed by an LL(1) parser. We looked at how FIRST sets for post-increment and post-decrement statements could lead to ambiguity for the parser, and how to fix the problem.

But even a simple language like SPTL has dozens of production rules, and for many of them the FIRST set may not be obvious.

For example:


program = elementblock

elementblock = {simpleelement}+

simpleelement = statement | vardec | functiondef


We've gone through three production rules, and not found anything resembling a SPTL token. And it continues:


statement = assignstatement | breakstatement | callstatement |
closestatement | foreachstatement | forstatement |
ifstatement | openstatement | popstatement |
predecstatement | preincstatement | printstatement |
pushstatement | readstatement | repeatstatement |
shiftstatement | whilestatement


At this rate, it will take forever to get anywhere. We need a method for constructing the set of FIRST sets. Fortunately, there are several rules which will make the job a lot easier.

But first, some definitions:

Terminal: The technical term for token. These are the symbols of the language.
Nonterminal: These are the terms that appear in the BNF but not in the language. A nonterminal will always have a production rule to transform it into something else.

Now we'll define:
A and B as nonterminals
w and x as terminals

And, as in the last post, the following symbols mean:

Empty set
Empty sentence
Intersection - the elements common to both sets
Union - the elements appearing in one or both sets


Now, the FIRST rules:



  1. FIRST(∈) = ∅
  2. FIRST(x) = x
  3. FIRST(w|x) = [w, x]
  4. FIRST(∈ x) = x;
  5. FIRST(w x) = w
      but if w can be empty, FIRST(w x) = [w, x]
  6. FIRST(A|B) = FIRST(A) ∪ FIRST(B)
  7. FIRST(A B) = FIRST(A)
      but if A can be empty, FIRST(A B) = FIRST(A) ∪ FIRST(B)



So far, so good. But how do we apply them?

We can see that every FIRST rule for a nonterminal gives us another FIRST rule, while all FIRST rules for terminals gives us a terminal (even if it's an empty terminal). Eventually we need to get a FIRST set of terminals for each rule. By building from the bottom up, we can get a FIRST set for each element of the BNF grammar.

For example, let's look at a portion of the expression rules for SPTL:


expr = boolexpr { boolop boolexpr }
boolexpr = relexpr [ relop relexpr ]
boolop = '&&' | '||'
relexpr = [ '-' ] term { addop term }
relop = '<' | '>' | '==' | '<=' | '>=' | '!='
term = factor { multop factor }
addop = '+' | '-'
factor = constant | varname | funcname | '(' expr ')' | negate
multop = '*' | '/' | '%'
negate = '!' factor
constant = boolconstant | numconstant | stringconstant
boolconstant = 'true' | 'false'
numconstant = digit {digit}
stringconstant = plainstring | escapestring
plainstring = ''' text '''
escapestring = '"' text '"'


If we start with expr = boolexpr { boolop boolexpr }, we get FIRST(expr) = FIRST(boolexpr), FIRST(boolexpr) = FIRST(relexpr), and so on. By starting at the bottom, we see:


FIRST(escapestring) = "
FIRST(plainstring) = '


By starting out at the bottom, with production rules that lead immediately to terminals, we can apply the FIRST rules repeatedly for the other production rules, to get a set of terminals for every production rule. So:


FIRST(stringconstant)
= FIRST(plainstring)FIRST(escapestring)
= [", ']


I've uploaded the complete set of FIRST sets for SPTL in gnumeric format or csv format.

My next post will look at FOLLOW sets.

Labels: , , ,

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home