CHAPTER 12
A GENERAL PARSER
The presentation of our parser has two parts. The first part, this
chapter and Chapter 14, covers the essentials of the parser, including
the use of general rewrite rules, and the polynomial upper bound for
context-free grammars. The upper bound is a function of the length
of the input string. (Chapter 13, in the middle, provides a proof of
correctness).
The second part, Section 16.2, optimizes the parser so as to be
efficient for grammars having many rules. Sometimes, our parser will
be used with computer-generated grammars, which can have many rules
indeed. This latter part does not improve the theoretical upper
bounds, but it does render faster execution times in practice.
The parser presented in this chapter alone is similar to an earlier
one by Thompson[], which was itself based on one by Kay[]. We modify
it in Chapter 14 so that it always terminates and works always in
polynomial time as a function of the length of the input string (for
context-free grammars).
Thompson has successfully used it to implement (formalizations of)
natural languages, including English and Japanese [].
We begin by lifting some datatypes from Chapter 9 and refitting them
for convenient parsing.
12.1
The Types PHRASE and CHOICE_OF_PHRASES
Let's redeclare the types PHRASE and CHOICE_OF_PHRASES introduced
in chapter 9:
TYPE PHRASE = [ LEFT: CHOICE_OF_PHRASES POS: INT ] ;
CHOICE_OF_PHRASES = { PHRASE } ;
Relative to chapter 9, we replace what was called the CHAR field
now with the POS (Part-Of-Speech) field, of type INT.
Any part-of-speech, including characters on your terminal, is
represented by a unique integer. We will write things like "<DIGIT>"
to represent the unique integer chosen to represent the part-of-speech
<DIGIT>. The POS field thus holds nonterminal parts-of-speech as
well as terminals.
We differ also from chapter 9 in that we make "the rest of the phrase",
next to the POS, appear to the ~left instead of to the right. There,
where phrases point rightward, ~new blocks are created going from
right to left.
However, we want to take user input from left-to-right. ~Newer blocks
are going to be created to the right. They now reference older blocks
to the left, via the LEFT field.
Figure 12.1 shows a CHOICE_OF_PHRASES together with each of its
constituent PHRASEs. Each PHRASE's LEFT field points to the same
place, to another CHOICE_OF_PHRASES. The leftmost part of figure 12.2
also shows the same CHOICE_OF_PHRASES. Further to the right, we see
another, similar looking CHOICE_OF_PHRASES. That rightmost
CHOICE_OF_PHRASES (omitting the bottom <FORM>) looks identical to
figure 12.1, and here we see what all the LEFT fields point to, the
middle CHOICE_OF_PHRASES.
BOX: Why do we have PHRASE point leftward?
BOX:
12.1.1
Examples Of CHOICE_OF_PHRASES
Figure 12.1 was created when the user typed a 1. The "numbers" and
"formulas" grammars (Chapter 1) were active. The rule
1 -> <DIGIT>
created the <DIGIT> phrase. The rule
<DIGIT> -> <NUMBER>
produced the <NUMBER> phrase. The <ATOM>, <TERM>, and <FORM> phrases
were generated by the rules:
<NUMBER> -> <ATOM>
<ATOM> -> <TERM>
<TERM> -> <FORM>
The leftmost part of figure 12.2, like figure 12.1, shows this. The
middle CHOICE_OF_PHRASES in figure 12.2 is created when the user types
the next character, "+". (The rightmost CHOICE_OF_PHRASES does not
exist yet).
Finally, the user types a 2, and the rightmost CHOICE_OF_PHRASES
comes into existence. Except for the very bottom <FORM> phrase, this
CHOICE_OF_PHRASES, triggered by the 2, looks identical to the leftmost
CHOICE_OF_PHRASES, and was created by the same rule applications
(except using
2 -> <DIGIT>
instead of
1 -> <DIGIT> ).
The bottom <FORM> phrase was generated by the rule:
<FORM> + <TERM> -> <FORM>
This odd block has a different LEFT field than the other PHRASE blocks.
This LEFT points to the same place that the leftmost CHOICE_OF_PHRASES
points to, to the left of the "1".
When a rule applies, it contributes its give-phrase in such a way as
to share the same left and right neighbors as the matched occurence
of its want-phrase. This is what it means to rewrite one phrase
to another, leaving unchanged everything to the left and right.
The want-phrase and the give-phrase share ~right neighbors by
appearing in the same CHOICE_OF_PHRASES. They share ~left neighbors
by sharing the same CHOICE_OF_PHRASES to the left (via the leftmost
PHRASE block's LEFT field).
12.2
What Is A ~Visible ~Phrase?
Start at the rightmost CHOICE_OF_PHRASES in figure 12.2.
You can travel leftward. Upon each CHOICE_OF_PHRASES, you choose
~one of the PHRASE blocks therein and continue leftward from it.
A visible phrase is any leftward going sequence of PHRASE blocks that
you encounter on such a leftward trip.
More precisely, a visible phrase is a leftward going sequence of
PHRASE blocks, where PHRASE(k) must be a left neighbor of
PHRASE(k-1):
PHRASE(k) is a member of the CHOICE_OF_PHRASES PHRASE(k-1).LEFT
Figure 12.3 shows such a sequence. Many visible phrases
are represented in a CHOICE_OF_PHRASES and in a PHRASE.
----------------(parentheses in previous paragraph mean subscripting)----
We will be intensely interested in the ~rightmost PHRASE block of any
visible phrase. The rightmost block is the most interesting because it
is the ~last of the PHRASE blocks created. A PHRASE block always
points leftward to an older block.
The first moment that a visible phrase comes to exist is when its
rightmost PHRASE block is created. We will apply a rule of grammar
upon a visible phrase only when its rightmost block comes to exist.
We say that a visible phrase ~emenates from a PHRASE block if that
block is the visible phrase's rightmost PHRASE block. We also say
that a visible phrase emenates from a CHOICE_OF_PHRASES, when the
CHOICE_OF_PHRASES contains the rightmost PHRASE block.
Our parsing strategy is to:
1) Represent the given string as a visible phrase.
2) Apply all rules of grammar so that the results of all possible
rewrites also appear as visible phrases.
3) If the given string is a valid member of the grammar's
language, then among all those visible phrases will be an
occurence of the grammar's root part-of-speech, spanning all
the way across the input string.
The parsing is done when that full-spanning root part-of-speech
appears.
BOX: What is the definition of a ~visible ~phrase?
BOX:
BOX: Will the original input string be a visible phrase?
BOX:
BOX: For legal input, what else will be a visible phrase?
12.2.1
Transition From Syntax To Semantics
In Chapter 1, we said that the end of syntax analysis and the beginning
of semantic processing occured when a rule like the following is
applied:
something -> a program
For us, that rule would be:
<ROOT_POS> -> a program
This rule would trigger at least when the full-spanning occurence of
<ROOT_POS> appears.
We should really write this rule as:
<START> <ROOT_POS> <FINISH> -> the program
and surround the given input string with <START>...<FINISH>. (<START>
and <FINISH> should be entirely new parts-of-speech, appearing in no
other rule).
This will limit the invocation of this program, the
commencement of semantics, to apply only for occurences of the root
part-of-speech that span the entirety of the given input string. This
guards against other conceivable occurences of the root part-of-speech,
which might occur over substrings of the given string.
The program will execute as soon as the phrase:
<START> <ROOT_POS> <FINISH>
appears as a visible phrase.
12.2.2
The Sequence Of Rewrites
Imagine the input string, and then imagine the next string derived
by applying one rewrite. You can imagine a sequence of full-spanning
strings, each one derived from the previous string by applying one
rewrite, somewhere. Such a sequence can end with a full-spanning
occurence of the root part-of-speech, if the original string is a
member of the grammar's language.
Each string in that sequence will in fact be a full-spanning visible
phrase in the finished, rightmost CHOICE_OF_PHRASES.
12.2.3
More Growth
Figure 12.2 shows all the CHOICE_OF_PHRASES that exist after the
user types the "1+2". Figure 12.4 shows the situation after the user
types a "*" and "3", beyond the original "1+2". The middle
CHOICE_OF_PHRASES and everything to its left is identical to figure
12.2. Remember, we will apply always ~subjective ~modification, and so
no existing data ever change.
The "1+2" in figure 12.2 is merely grown upon. All possible
interpretations of the "1+2" phrase already exist in figure 12.2.
Figure 12.4 is a rightward augmentation of figure 12.2.
The rightmost CHOICE_OF_PHRASES is similar to figure 12.1, with the
addition of three new phrases, a <TERM>, a <FORM>, and another <FORM>
(bottom). Of those last three phrases, the <TERM> was generated by the
rule:
<TERM> * <ATOM> -> <TERM>
upon matching the <TERM> below "2" and the <ATOM> below "3".
The <FORM> (second from the bottom) was generated by the application
of the rule:
<TERM> -> <FORM>
upon our latest <TERM>. Naturally, both the <TERM> and <FORM> have
the same span (the same left neighbors and ultimately the same right
neighbors). The very bottom <FORM> was generated by the rule
<FORM> + <TERM> -> <FORM>
It matched the <FORM> below "1" and the new <TERM> below "2*3". This
latest <FORM> spans the entire input string, because the matched phrase
did.
Figures 12.1, 12.2, and 12.4 show all possible interpretations within
the phrases "1", "1+2", and "1+2*3" respectively. All possible rules
have applied at all possible places. We will always build such rich
CHOICE_OF_PHRASES.
We need a way to see all possible visible phrases represented in a
CHOICE_OF_PHRASES. Rules need to see occurences of their want-phrases
so that they can apply.
BOX: Is there a PHRASE block for "1+2" in the string "1+2*3"
BOX: in figure 12.4?
BOX:
BOX: Is there a <TERM> phrase for "1+2" in the figure?
BOX:
BOX: Are any of the phrases spanning "1+2" modified or
BOX: deleted by the time we take in the "*3"?
12.3
Seeing All Visible Phrases Emenating From A CHOICE_OF_PHRASES
We use the type CHOICE_OF_PHRASES to represent an ambiguous phrase, the
structure representing many individual, unambiguous phrases. Each
rule of grammar "wants" to see occurences of its want-phrase in the
CHOICE_OF_PHRASES. That is, since rules are supposed to apply
everywhere as much as possible, we say that a rule "wants" to see all
visible phrases matching its want-phrase.
Given a CHOICE_OF_PHRASES in a variable C, e.g., as in figure 12.5(a),
the following quantifier (loop generator) sets P1 to each individual
PHRASE in C:
FOR P1 $E C ;
Figure 12.5(a) shows the values taken on by P1, given C.
Let's concentrate for the moment on the rule:
<TERM> -> <FORM>
This rule "wants" to see all possible <TERM>s. If we limit our
attention to the CHOICE_OF_PHRASES C, then we can see only one <TERM>.
In general, there may be more than one <TERM> in a CHOICE_OF_PHRASES
(e.g., in the rightmost CHOICE_OF_PHRASES in figure 12.4).
With the following concise notation, we can cause one iteration not
per PHRASE in C, but per PHRASE whose POS is <TERM>:
FOR P1 $E C; WITH P1.POS = <TERM> ;
The "FOR" part causes one iteration per PHRASE in C, and the "WITH"
part ~filters ~out all iterations except for those where
P1.POS = <TERM>. Only P1's whose POS equals <TERM> survive (figure
12.5(b)). (Sections 22.3.1 and 22.3.3 present these constructions more
completely).
We always apply a quantifier to something else. The quantifier causes
iterations, and the something else is executed once per iteration.
We can apply a quantifier to a sentence, e.g.,
FOR P1 $E C;
DO something END
or
FOR P1 $E C; WITH P1.POS = <TERM>;
DO something END
The latter example could be rewritten more traditionally as:
FOR P1 $E C;
DO
IF P1.POS = <TERM> THEN something FI
END
The "WITH" notation is more concise, and we hope clearer.
The "<TERM> -> <FORM>" rule wants to match all <TERM>s, and for each
match, it gives a <FORM> having the same span as the matched <TERM>.
We can write this tendency as a program:
FOR P1 $E C; WITH P1.POS = <TERM>;
DO
GIVE( [LEFT: P1.LEFT POS: <FORM>] );
END
For each matched <TERM>, we call GIVE. We pass to GIVE the replacement
phrase. The replacement phrase, the rule's give-phrase, has a LEFT
and a POS field. In this example, the POS (Part-Of-Speech) is <FORM>
and the LEFT is P1.LEFT, the ~LEFT field of the matched phrase.
GIVE inserts the new given phrase into the CHOICE_OF_PHRASES C. For
the moment, we can imagine that GIVE is defined by:
DEFINE GIVE( P: PHRASE ):
C:= C $> P ;
ENDDEFN
C thus acquires the new phrase, having now both the original <TERM>
phrase and now the new <FORM> phrase passed to GIVE as well.
The "<TERM> -> <FORM>" rule has been presented as a program, a program
which sees all <TERM>s in the CHOICE_OF_PHRASES C and then inserts
a corresponding <FORM> for each one. The inserted <FORM> has the same
span as the found <TERM>: Both have the same LEFT (P1.LEFT) and both
reside in the same CHOICE_OF_PHRASES C.
Let's consider another rule, whose want-phrase has length greater than
one:
<FORM> + <TERM> -> <FORM>
Consider figure 12.6. We match the rightmost part-of-speech in the
want-phrase first:
FOR P1 $E C; WITH P1.POS = <TERM> ;
Figure 12.6 shows the value in P1 for the one iteration caused by this
quantifier. (We use the quantifier because in general, there may be
more than one match). Given that P1, we can
acquire all possible phrases to its left, by writing:
FOR P2 $E P1.LEFT ;
P2 is set to all PHRASEs in the CHOICE_OF_PHRASES P1.LEFT. We filter
out everyting except "+"s via:
FOR P2 $E P1.LEFT ; WITH P2.POS = '+' ;
Of course in this example, this quantifier also causes only one
iteration. The one value for P2 appears in the figure at the "+".
We put both quantifiers together using the ICL operator "!!":
FOR P1 $E C; WITH P1.POS = <TERM> ;
!!
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
The "!!" means:
for each iteration caused by the first quantifier,
make the second quantifier cause all of its iterations.
This is a ~nesting of loops.
(Section 22.3.2 documents this operator). For example, the quantifier:
FOR P1 $E C;
!!
FOR P2 $E P1.LEFT;
sets P1 to each phrase in C, and for each of those P1's, it sets P2 to
all phrases to the immediate left of P1. For each P1, we set P2 to
each phrase in the CHOICE_OF_PHRASES to the left. This quantifier can
be written less concisely but more familiarly as:
FOR P1 $E C;
DO
FOR P2 $E P1.LEFT ;
DO
something
END
END
We match all possible "<FORM> + <TERM>" phrases starting from C via
the quantifier:
FOR P1 $E C; WITH P1.POS = <TERM> ;
!!
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
P1 is taken from C, P2 is taken from P1.LEFT, and P3 is taken from
P2.LEFT. The triple P3,P2,P1 represents the matched phrase, on each
iteration. This is a three-dimensional quantifier, and can be written
less concisely as:
FOR P1 $E C;
DO IF P1.POS=<TERM> THEN
FOR P2 $E P1.LEFT;
DO IF P2.POS='+' THEN
FOR P3 $E P2.LEFT;
DO IF P3.POS=<FORM> THEN
something
FI
END
FI
END
FI
END
We complete the rule by applying that 3-D quantifier to a GIVE:
FOR P1 $E C; WITH P1.POS = <TERM> ;
!!
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO GIVE( [LEFT: P3.LEFT POS: <FORM>] ); END
Each match (iteration) GIVEs a <FORM> whose LEFT is that of P3, the
leftmost matched item.
In general, we translate a rule:
<POS(1)> <POS(2)> ... <POS(n)> -> <GIVE_POS>
into
FOR P(n) $E C; WITH P(n).POS = POS(n) ;
!!
FOR P(N-1) $E P(n).LEFT; WITH P(n-1).POS = POS(n-1) ;
!!
...
!!
FOR P(1) $E P(2); WITH P(1).POS = POS(1);
DO
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS>] );
END
For example, a rule whose want-phrase is of length 1 is:
FOR P(1) $E C; WITH P(1).POS = POS(1) ;
DO
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS>] );
END
---------------- Parentheses in previous paragraph mean subscripting! ---
----------------(except after "GIVE")------------------------------------
Let's see how the 3-D quantifier just shown for the "<FORM>+<TERM>"
rule behaves. Figure 12.7, which represents "1+2*3", shows the values
taken on by the variables P1, P2, and P3. The outer quantifier, the
P1 quantifier, has two iterations, defining P1(1) and P1(2). (Those
are the only <TERM>s emenating from C).
---------------- Parentheses in previous paragraph around "1" and "2"
---------------- mean subscripting! ----
For P1(1), the second quantifier, which looks for a "+", causes zero
iterations. (There is no "+" in the CHOICE_OF_PHRASES P1(1).LEFT,
only a "*"). Thus P1(1) appears in zero iterations of the full 3-D
quantifier. However, P1(2), whose LEFT is different from that of
P1(1), does land us at a place where the second quantifier ("+") can
now generate one iteration. From there, the innermost quantifier,
which is looking for a <FORM>, itself causes one iteration. The full
3-D quantifier thus generates a total of one iteration, matching only
one phrase. (There are no other occurences of the phrase
"<FORM>+<TERM>" emenating from C).
---------------- Parentheses in previous paragraph around "1" and "2"
---------------- mean subscripting! ----
Consider figure 12.8, which represents "1+2+3". The first quantifier
causes one iteration. The second quantifier from there matches the
second plus, causing one iteration. The innermost, <FORM>, quantifier
causes two iterations, defining two values for P3. The whole 3-D
quantifier thus causes a total of two iterations.
P3 is different in those two grand iterations. P3.LEFT is also
different on each iteration. Notice how this 3-D quantifier matches
two phrases, where each has a different span (LEFT). This will cause
the "<FORM>+<TERM>" rule to apply twice, once under the "2+3" and once
under the "1+2+3".
BOX: How are visible phrases seen?
BOX:
BOX: How do you translate the rule:
BOX:
BOX: <ATOM> -> <TERM>
BOX:
BOX: into a program?
BOX:
BOX: How about the rule:
BOX:
BOX: ( <FORM> ) -> <ATOM> ?
12.3.1
The Grammar Is A Program
We have seen how to render a grammar rule as a program. The program
finds all matches in the CHOICE_OF_PHRASES C for the rule's want-
phrase. For each match, the rule GIVEs its give-phrase. The given
phrase will reside on C, just like the rightmost end of the matched
phrase does (P1, in C). Also, both the matched and the new phrase
share the same LEFT neighbors (P3.LEFT). Each iteration rewrites
(subjectively) the matched phrase into the give-phrase, preserving
left and right neighbors.
Imagine such a program for each rule of grammar. Put them all
together, one after another, in any order. That overall program is
said to be the "grammar program".
Our grammar program will apply all rules upon all the phrases emenating
from C. Each application augments C, so that the rules' give-phrases
come to exist in C.
12.3.2
Assuring That All Rules Apply
Given a CHOICE_OF_PHRASES that includes the "1" phrase in figure 12.1,
a call to the grammar will introduce the <DIGIT> phrase. Another call
will now see the <DIGIT>, and will introduce the <NUMBER> phrase.
How many times must we call the grammar in order to assure that all
rules have applied?
We need to call the grammar each time a ~new phrase comes to exist.
Remember, the grammar has to see all possible phrases, in our effort
to apply all rules in all possible ways. If we give the grammar a
crack at all new phrases, since every phrase is new at some time, all
rules will have been applied to all phrases. A new phrase comes to
exist in GIVE, when GIVE puts the new phrase onto C.
Let's make GIVE call the grammar:
DEFINE GIVE( P: PHRASE ):
C:= C $> P ;
call the grammar program
ENDDEFN
Now, whenever a phrase comes to exist (GIVE), the grammar will be
employed and will try to apply all rules to it.
In the end, if you see an occurence of some rule's want-phrase
emenating from C, then we know that the rule's give-phrase also
emenates from C, sharing the same LEFT with the want-phrase. That is,
that occurence of the want-phrase has a rightmost PHRASE block. At
the time GIVE put it onto C, GIVE also called the grammar, and hence
the grammar saw the want-phrase. At that time, the grammar introduced
(via GIVE) its give-phrase.
The GIVE procedure and the grammar are mutually recursive. Each calls
the other.
12.3.3
A Necessary Refinement
We almost have a working parser. We will verify its completeness and
its termination shortly. However, our most recent definition of GIVE
causes some redundancies.
GIVE not only inserts the given phrase P onto the global variable C,
but it also calls the grammar. The grammar sees ~all phrases in C
each time. That is, if C contains a phrase X now, then each time the
grammar calls GIVE, GIVE's subsequent call to the grammar will have
the grammar see the phrase X again, because X is still on C. All
phrases in C will be seen again and again, each time GIVE calls the
grammar.
We really want the grammar to see each phrase only once. The grammar
should examine not C, but only the one phrase P passed into GIVE.
Let's change GIVE as follows:
DEFINE GIVE( P: PHRASE ):
C:= C $> P ;
"call the grammar passing in P..."
the_grammar( P );
ENDDEFN
GIVE now passes P into the grammar. We must now change slightly the
grammar program. The grammar, a function, can itself be defined as:
DEFINE the_grammar( P: PHRASE ):
program for rule #1
program for rule #2
...
program for rule #n
ENDDEFN
The grammar now takes in a parameter P. The grammar should use P
instead of C as it tries to find phrase matches. The program for the
"<FORM>+<TERM>" rule has as its first line:
FOR P1 $E C; WITH P1.POS = <TERM> ;
!!
...
Let's change that to be:
FOR P1 $E {P}; WITH P1.POS = <TERM> ;
We have replaced C with a new CHOICE_OF_PHRASES that contains only the
one PHRASE P. All the rest of the rule's program is unchanged.
Now the grammar sees only the new PHRASE given to GIVE, and not all of
C.
Here is the whole program for this rule:
FOR P1 $E {P}; WITH P1.POS = <TERM> ;
!!
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO GIVE( [LEFT: P3.LEFT POS: <FORM>] ); END
For efficiency, let's replace the first line again. Notice that the
FOR quantifier on that line causes only one iteration. The WITH
clause might filter out that one iteration. Thus, the first line
causes zero or one iterations.
Let's change the first line and its subsequent "!!" into an IF
statement. (IF statements always cause zero or one "iterations" over
the IF's THEN-clause). The rule now appears as:
P1:= P;
IF P1.POS = <TERM> THEN
" The rest is unchanged. "
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO GIVE( [LEFT: P3.LEFT POS: <FORM>] ); END
FI
The
P1:= P ;
IF P1.POS = <TERM> THEN
...
replaces the old, and equivalent:
FOR P1 $e {P}; WITH P1.POS = <TERM> ;
!!
...
Our general program for any one rule:
<POS(1)> <POS(2)> ... <POS(n)> -> <GIVE_POS>
appears as:
P(n) := P ;
IF P(n).POS = POS(n) THEN
FOR P(n-1) $E P(n).LEFT; WITH P(n-1) = POS(n-1) ;
!!
FOR P(n-2) $E P(n-1).LEFT; WITH P(n-2) = POS(n-2);;
!!
...
!!
FOR P(1) $E P(2).LEFT; WITH P(1) = POS(1);
DO
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS>] );
END
FI
Coercion rules, whose want-phrases are of length 1, thus come out with
only an IF and no FORs:
P(1) := P;
IF P(1).POS = POS(1) THEN
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS>] );
FI
---------------- Parentheses in previous paragraph mean subscripting! ---
---------------- except after "GIVE" ------------------------------------
Recall that the call to GIVE puts the new give-phrase onto C. Thus,
the new phrase spans from P(1).LEFT to C, the same span covered by the
matched want-phrase.
---------------- Parentheses in previous paragraph mean subscripting! ---
BOX: Why is an IF-statement instead of a FOR-quantifier used to see
BOX: the rightmost token in a phrase?
BOX:
BOX: How many FOR-quantifiers are nested ("!!") within a rule
BOX: whose want-phrase's length is L?
BOX:
BOX: Why do we examine visible phrases from right-to-left?
BOX:
BOX: Why is every visible phrase seen by the grammar?
12.4
Growing An Ambiguous Phrase From User Input
We produce figure 12.1 via the following:
C:= NIL;
GIVE( [POS:'1'] );
C is now figure 12.1. Recall that C is a global variable augmented
by GIVE.
The "1" PHRASE block in that figure is the one
passed to GIVE from here. The others exist because GIVE calls the
grammar so as to see the "1" phrase. The grammar GIVEs all the other
PHRASE blocks in this figure.
We take in the next character and place to the right of C as follows:
LEFT:= C;
C:= NIL;
GIVE( [LEFT:LEFT POS:'+'] );
The "+" PHRASE block has as its LEFT the result of the previous step,
figure 12.1.
We take in each successive character and do exactly the same thing:
LEFT:= C; "What will be to our left: the result of the
previous step"
C:= NIL; "Clear the slate"
GIVE( [LEFT:LEFT POS: the new character ] );
The whole input is processed by the loop:
LEFT:= NIL;
FOR each input character CHAR
DO
C:= NIL;
GIVE( [LEFT:LEFT POS:CHAR] );
LEFT:= C;
END
At the end, C has the answer.
The important thing is that the PHRASE block for the K'th character
has as its LEFT a CHOICE_OF_PHRASES that includes the (K-1)'th
character. That is, when GIVing the K'th character, LEFT held the
previous value of C, which contained the (K-1)'th character.
The final value in C thus contains the given input string
as a visible phrase. It is a full-spanning phrase because it
emenates from C, the far right. The LEFT of the leftmost character is
NIL. There is nothing further to the left.
At this point, our parser is done for context-free rules, except for
one more modification which will be necessary in order to
guarantee that this parser finishes in finite time for all context-
free grammars (Chapter 14).
BOX: Why is the input string a visible phrase in C, after
BOX: taking all the user input?
BOX:
BOX: Should we ignore ~white ~space (like spaces)? (Section
BOX: 17.2).
12.5
General Rewrite Rules
General rewrite rules are rules whose give-phrases have length greater
than one, like:
<X> -> <A> <B>
or
<X> <Y> -> <A> <B>
or
<X> <Y> <Z> -> <A> <B>
Each general rewrite rule is turned into a program very much like other
rules. Any rule:
<POS(1)> <POS(2)> ... <POS(n)> -> ??
becomes a program:
P(n):= P;
IF P(n).POS = <POS(n)> THEN
FOR P(n-1) $E P(n).LEFT; WITH P(n-1).POS = <POS(n-1)> ;
!!
...
!!
FOR P(1) $E P(2).LEFT; WITH P(1).POS = <POS(1)> ;
DO
??
END
This shows how the want-phrase is turned into a program. The "??"
represents the give-phrase (or some arbitrary action). For context-
free rules, the "??" is simply
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS>] );
---------------- Parentheses in previous paragraph mean subscripting! ---
---------------- except after "GIVE" ------------------------------------
Consider the following rule's give-phrase, of length greater than one:
... -> <GIVE_POS(1)> <GIVE_POS(2)> ... <GIVE_POS(k)>
The give part for this rule will look almost like the situation when
taking user input.
---------------- Parentheses in previous paragraph mean subscripting! ---
Each <GIVE_POS> is GIVEn, from left-to-right. (We will
come back and disect the following):
" Leftmost PHRASE block ... "
RIGHTHAND:= C; "(Save C, for use in the rightmost GIVE)"
C:= NIL;
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS(1)>] );
" Second to leftmost PHRASE block ... "
LEFT:= C; "(This looks identical to taking user input)"
C:= NIL;
GIVE( [LEFT: LEFT POS: <GIVE_POS(2)>] );
...
" Second to rightmost PHRASE block ... "
LEFT:= C; "(This looks identical to taking user input)"
C:= NIL;
GIVE( [LEFT:LEFT POS: <GIVE_POS(k-1)>] );
" Rightmost PHRASE block ... "
LEFT:= C;
C:= RIGHTHAND; "Restore C to original value"
GIVE( [LEFT:LEFT POS: <GIVE_POS(k)>] );
This looks identical to taking user input, except for the leftmost and
rightmost GIVEs.
---------------- Parentheses in previous paragraph around "1", "2"
---------------- "k", and "k-1" mean subscripting! ---------------
The final (rightmost) GIVE is exactly like the GIVE in a context-free
rule. Both augment C, the C in force when the grammar was called.
Thus, C winds up containing the rightmost PHRASE block of both the
matched phrase and the new phrase given by the general rewrite rule.
Let's examine the give-phrase generation more closely.
Let RIGHTHAND and LEFT be (local) variables of type CHOICE_OF_PHRASES.
The ~leftmost block is always generated by:
RIGHTHAND:= C; "(Save C, for use in the rightmost GIVE)"
C:= NIL;
GIVE( [LEFT: P(1).LEFT POS: <GIVE_POS(1)>] );
We set C to NIL so that this GIVE creates a new CHOICE_OF_PHRASES.
---------------- Parentheses in previous paragraph around "1"
---------------- mean subscripting! -----------------------
The ~rightmost block is always generated by:
LEFT:= C;
C:= RIGHTHAND; "(Restore C to original value)"
GIVE( [LEFT: LEFT POS: <GIVE_POS(k)>] );
We set C back to its original value (RIGHTHAND) so that this
rightmost PHRASE block comes to reside on the same CHOICE_OF_PHRASES
---------------- Parentheses in previous paragraph around "k"
---------------- mean subscripting! ------------------------------------
that holds the want-phrase's rightmost PHRASE block. This assures
that both the want- and give-phrases will share ultimately the same
right neighbors.
Any ~middle block is always generated by:
LEFT:= C; "This looks exactly like taking user input"
C:= NIL;
GIVE( [LEFT: LEFT POS: <GIVE_POS(i)>] );
Figure 12.9 shows the want-phrase and the newly generated give-
phrase. Notice that for all but the rightmost part-of-speech in the
give-phrase, each one gets its own CHOICE_OF_PHRASES, a new point of
view.
---------------- Parentheses in previous paragraph around "i"
---------------- mean subscripting! ------------------------------------
Finally, C is restored by setting it to RIGHTHAND for the last part-
of-speech. Thus, our new give-phrase, like the want-phrase, will
emenate from the same our CHOICE_OF_PHRASES, C (the original). The
give- and want-phrases have the same LEFTs in their leftmost PHRASE
blocks. (Notice that we supply the want-phrase's LEFT, P(1).LEFT, as
the LEFT for our first, leftmost PHRASE block in the give-phrase).
This concludes the presentation of Thompson's and Kay's parser. We
still need to modify it so as to achieve a polynomial and not
exponential (or infinite) cost. That appears in Chapter 14.
The following chapter studies the parser close enough to prove that
it always works.
BOX: How are general rewrite rules similar to the process
BOX: of taking user input?
BOX:
BOX: What is the variable RIGHTHAND used for?