Thursday, March 30, 2006

Lab 2 a



First Point of Failure



Wrong Decision


Second Deadend


Question 1

After 14 steps, srparser finds a complete sentence "My dog saw a man". Further steps, finds the prepositional phrases "in the park", " with a statue" but parsing fails because the parse did not span whole of this input and there is no backtracking to try a new parse. After correctly parsing "my dog" as an NP, `saw` as a Verb, `a man` as a NP, the parser chooses to reduce and "saw a man" becomes a VP, the failure point is when NP and VP is merged as a sentence. (14th step) The parser should have shifted "in" instead of merging NP and VP as a S.
The parser gets stuck for the second time, after 'in the park' is reduced to a PP, ('in the park') merged with 'saw a man' to form a VP, at the point where the new VP->VP-PP pairing can combine with former NP to form an S, leaving 'with a statue' dangling out of the sentence
It had to continue to shift instead because "in the park with a statue" must have been parsed as a whole PP.

Question 2(a)
The parser uses a reduce>shift strategy. Whenever a conflict occurred , the parser chose to reduce instead of shifting. (e.g. "in the park", "with a statue" as two seperate PP instead of a single PP, or making a sentence parse for "My dog saw a man" without shifting the remaining tokens.)

Question 2(b)

A single strategy of shift>reduce would not work.

If our strategy involves first shift & reduce each token with
all words are reduced to their word parts (N, V etc) but no further reduction until the end

then "a statue" is reduced to an NP ( Det N)
"with a statue" is found (PP -> P NP)
the deadend comes when 'park' (N) is tried to reduced with with a statue (PP) because there is no production rule that would reduce N PP to a nonterminal.

On the other hand, if we assume we can defer S -> NP VP until the end, the reduce> shift approach would work. As we can see from both point of failures for shift-reduce, deadends happened as NP VP were reduced to a premature S


Question 2(c)


Correct parse with reduce>shift, deep tree (deferring S->NP VP)


Tokens parsed with 5 shift and 7 reduces

shift>reduce would produce flatter trees
reduce>shift would produce straggly trees

Reduce>shift strategy, reduces tokens with nonterminals on left hand sides as soon as an item is pushed to the stack with a corresponding production rule. This would create a slant towards branching further down compared to shift>reduce where branching down (reduction) can be deferred until all the tokens are shifted. This is also the reason why shift-reduce would be more likely to straggly, it may wait arbitrary number of steps before reduce

Question 2(d)

reduce>shift minimizes the number of elements in the stack because, nonterminals are reduced to other nonterminals immediately. This contrasts with shift>reduce strategy where tokens in the stack wait to be combined with other nonterminals at a later time. Shift>reduce, keeps tokens in the stack as long as there is a conflict between shift and reduce.

Question2(e)

"Human parsing" seems to follow reduce>shift with backtracking. As humans have short stacks (memory), they try to group together words as they hear them. So the sentence could have been parsed like
my dog = NP
saw a man= VP
in the park= PP
with a statue = PP
(a little bit better than pure reduce>shift)
realizing the parse does not make sense (seeing a man with a statue), humans would backtrack to find the correct parse

Question 3.

3-a
There are no competing rules with same RHS and different resulting nonterminals on LHS in this grammar.
The recursions are self contained, too so
eg

NP -> X Y
VP -> NP V

there exist no rules such as
XP -> X Y V

So reduce-reduce conflicts cannot occur

3-b
A strategy to resolve reduce-reduce conflicts might involve constructing look-up tables for terminals coming right after each of multiple non-terminals for a parse.
e.g. X -> V NP
Y -> V NP
D -> X M
F -> Y C
by looking ahead whether a next token is C or M we can resolve the conflict
Of course, this still wouldn't resolve conflicts where an ambiguous parse for a terminal is followed by same tokens
in that case we might choose the non-terminal (X,Y) which comes first in the rule table of grammar

Question 4

4(a)
As n grows large, the stack can grow as deep as in the order of O(n)

e.g. a sentence like
V PP PP PP PP
where each token form children of a node, which in turn becomes the right hand sibling of another constituent on the left (the rightmost PPs form a constituent X which in turn forms a constituent with the third PP, and the procedure goes on like this)


4(b)
SRparser is not real-time. There's not a fixed number of steps before a symbol is shifted onto the stack. This is because we might reduce items
in a stack or shift them in no pre-determined fashion

Any of tokens may wait for variable number of reduce/shift actions before it is processed.

e.g.
I shot the good inadvertently
'The' (det) may wait for two shift+two reduce operations and/or two shift + three reduce operation before it is moved to the stack

4(c)
The index n-1 of the Catalan numbers is the number of pairs of parenthesis needed to paranthesize n symbols
C(n)= sigma (i=1..n) C(i-1) C (n-1-i)
where C(0)=1

as Catalan numbers C(n) in this case, can be interpreted as number of binary trees with exactly n nodes

*
Design and Analysis of Computer Algorithms -David M. Mount
Department of Computer Science
University of Maryland
Fall 2003 Lecture notes

Question 5

Bottom up produces 54 edges whereas top down produces 74 edges. So roughly bottom up produces approximately 27 % less number of edges than top-down did for this sentence.

This phenomenon occurs because bottom up processing takes advantage of lexicon information. It does not have the overhead listing of all possible parses without processing the input as is the case with top-down processing.
(In fact most of the edges created by top down and not by bottom up is in the form constituent -> 'terminal word' )
On the other hand, top-down parser never creates edges of S -> ... that does not lead to sentences.
With bottom up parsing , we can see attempts at forming an S, in the middle of the sentence even though we don't have recursive production rules for S such as S-> S NP

In general, I believe bottom-up parsing would perform better for grammars with a huge lexicon.
For grammars with many production rules leading to an S and small lexicon, top-down parser may be more advantageous to be used efficiently.

Question 6

Bottom-up parser can create an edge such as

S -> NP*VP for the
'the cookie' part

In fact, bottom-up parser can find sentences in middle of a sentence simply because, given a complete edge ("NP" in this case), it looks for all rules starting with this constituent

Top down parser would not produce sentences in middle positions with this grammar. Because the only production rule for starting sentences is S-> * NP VP and no production rule expects S as a first constituent. In the case of 'the cookie', when top down parser has processed 'John ate', it has only predictions in the form of an NP (coming from VP*NP) and there are no production rules with NP on LHS starting with an S

Top-down parser produces superfluous edges such as

NP -> * I at the start of the parse because it just creates edges based on the grammar without looking up actual word, so instead of immediately acknowledging John is an 'NP', it plots out all possible starts of a sentence in the grammar and gets down to listing all terminals. Because bottom up parser utilizes lexicon information, it does not create such edges.

Question 7



First parse for the sentence



Second parse for the sentence

I found a successful parse with 37 edges. The strategy is as follows

First
!.use bottom up parsing to recognize each individual word e.g.
e.g. N-> * 'table'
N-> 'table' *
2. Apply fundamental rule to edges, to form nonterminals out of terminals created in the first step
3. Make top down initialization
4. Make top down predictions to set out possible productions for a sentence
5. Apply fundamental rule to non-terminals
6. Carry out top-down strategy when non-terminals created in prev. state become "dense" enough. (S-> NP VP VP-> V NP VP-> V NP PP)



Question 8

51 edges are created with earley parse
, which shows higher efficiency for this sentence given other strategies, but not a significantly different performance over bottom up parsing for this sentence. It is better than bottom up processing in the sense that it does not create infeasible S edges (where S cannot start such as in tthe middle of a sentence). It is more efficient than top down processing because knowing a terminal, it can cut down on the proliferation of terminal rules each time e.g. knowing john is a name and a name can be one of feasible constitutent of a start sentence, it can match the NP phrase
without looking up starting possibilities of det->'the' etc. .

Question 9

Even if there are superpolynomial parses for a tree, the chart can fit them because it has subparses of previous states and recursively it can combine the whole parse from its parts by going back to previous states
for example

For example
We understand that at least one successful parse has been made when we see
|================| S -> NP VP
from then on we can check those edges with VP spanning to the end.
we find two VP parses in the form
|. [------------]|
which are the two parses
VP-VP PP
VP->V NP

now lets take VP-> VP PP
we know VP was in the end and within VP, PP is the last constituent. We then proceed to look for edges ending at PP
and this recursion continues until we find all feasible parses subject to
-The sequence of constituents in rules for subtrees whole of sentence
-Number of constituents

e.g. Suppose our sentence length is 7. if we find a VP at the end spanning 6 words but there are no NPs at the start with 1 word token then the parse for this rule (where LHS=VP) is discarded.

Jurafsky-Martin simply adds explicit pointers to previous states to achieve compactness. I believe this approach can also be used even though it may not seem as feasible as doing look ups

Friday, March 10, 2006

Characters, Subsets, Rules, Lexicon

For this lab, I introduced 4 characters to the alphabet. These are

J is Used to represent the verb coger. so the lexical form for coger is coJer which is g by default and takes the surface form j before a suffix with o or a

The g-j mutation is formulated in the rule J realized as j. This rule also uses the subset B to represent a, o. This rule simply assures J is not realized as j unless it is in the left context of an environment starting with a character from subset B

The z-c mutation is handled by z realized as c rule. This rule converts lexical z to c only in the right context of a morpheme starting with an e. This situation can also arise in e insertion for pluralization. The rule also takes that into account. This rule interacts with pluralization because z is also a consonant. In this rule, we explicitly allow passing of e following a character from consonant set CO

Pluralization, likewise, takes z-c mutation into account. It takes a word ending with a consonant and insert and e if the string takes the plural suffix s.

The other characters introduced are for verbs in the lexicon. The irregular irregular verb cocer is represented as KOQ. (coc) and its surface form is cue. Since the examples in the batch file just take cocer’s irregular forms starting with cue, this solution could suffice.

For the irregular verbs such as ejercer, the character C was used. Lexical C becomes null in surface form. There are two types of irregular verbs written with C. One of them is the as two groups of irregular verbs : first group conocer, paracer drop the c in the root of the verb and takes zc in their conjugation. (suffixes are written concanated with zc as in zcamos) The other group with verbs vencer, ejercer , drops the c in the rot and takes z instead. They and their suffixes are represented similarly to the first group.

There are other irregular verbs such as cruzar which has irregularities in indicative and subjunctive tenses.

I must note that wherever a suffix was ambigious, a set of suffixes were created to avoid the situation of multiple & incorrect interpretations of a verb. E.g.
We group a suffix for all kinds of verbs (ar, er,ir, irregular ones together)
V_SUFFIX6:
+e V_Suffix6 (pres.subj,1p,sg)
+a V_Suffix6 (pres.subj,1p,sg)
+zca V_Suffix6 (pres.subj,1p,sg)
+za V_Suffix6 (pres.subj,1p,sg)
now when I have an irregular verb like cruce, it may also be interpreted as 3p,sg because it takes a suffix group that does not explicitly relate a group of verbs to their suffixes but rather aggregates them. I might have created redundant suffixes in the end.

The big design issue for me was to represent irregular verbs and being as close to the original as possible.

As for extensibility, I feel I had improved over my one automaton solution in the first assignment. The rules are self-contained to the extent of interactions with other rules.
I believe same goes for the lexicon. The representation of irregularity of verbs may be my one concern, as with this approach it seems you can run out of surface characters quickly if you try to handle a lot of irregularities.

Batch Test - Results






Lab1b-New Rules Rule

martdokuz.rul


ALPHABET a ^ b c C d e < f g h i { j J k K l m n ~ o O * p q Q r s t u > v w x y
z Z +
NULL 0
ANY @
BOUNDARY #
SUBSET V a e i o u ^ < { * >
SUBSET BACK u o a > * ^
SUBSET FRONT e i < {
SUBSET LOW e o a < * ^
SUBSET HIGH i u { >
SUBSET CO b c d f g h j J k l m n ~ p q r s t v w x y z Z
SUBSET B a o


RULE "Default Character Pairings" 1 41
a e i o u ^ < { * > b c C d f g h j J k K l m n ~ O p q Q r s t v w x y z Z +
@ #
a e i o u ^ < { * > b c 0 d f g h j g k c l m n ~ u p q e r s t v w x y z z 0
@ #
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1



RULE "J realized as j" 5 5
J J + B @
j @ 0 B @
1: 4 2 1 1 1
2: 4 2 3 1 1
3: 4 2 1 0 1
4. 0 0 5 0 0
5. 0 0 0 1 0

RULE "z realized as c" 9 8
z z + e 0 s @ CO
c @ 0 e e s @ CO
1: 4 2 1 1 0 7 1 7
2: 4 2 3 1 0 7 1 7
3: 4 2 3 0 0 0 1 7
4. 0 0 5 0 0 0 0 0
5. 0 0 0 1 6 0 0 0
6. 0 0 0 0 0 7 0 0
7: 4 2 8 1 0 7 1 7
8. 0 0 0 1 9 0 1 0
9. 0 0 0 0 0 7 0 0


RULE "pluralization" 7 7
CO + 0 s z e @
CO 0 e s c e @
1: 2 1 0 2 5 1 1
2: 2 3 0 2 5 1 1
3. 0 1 4 0 5 1 1
4. 0 0 0 2 0 0 0
5. 0 6 0 0 0 0 0
6. 0 0 7 0 0 1 0
7. 0 0 0 2 0 0 0

END


martdokuz.lex


;State Name Arc transition label abbreviation
Begin: N_ROOT V_ROOT_AR V_ROOT_ER V_ROOT_IR V_ROOT_IRG1 V_ROOT_IRG2 V
_ROOT_IRG3 V_ROOT_IRG4 V_ROOT_IRG5 END
N_Root1: NUMBER
Number: END

V_Root_Ar: V_SUFFIX1 V_SUFFIX2 V_SUFFIX3 V_SUFFIX4 V_SUFFIX5 V_SUFFIX17 V_SUFFIX
18 V_SUFFIX19 V_SUFFIX20 V_SUFFIX21 V_SUFFIX16 END
V_Root_Er: V_SUFFIX16 V_SUFFIX22 V_SUFFIX23 V_SUFFIX24 V_SUFFIX25 V_SUFFIX26 V_S
UFFIX27 V_SUFFIX32 V_SUFFIX33 V_SUFFIX34 V_SUFFIX35 END
V_Root_Ir: V_SUFFIX1 V_SUFFIX2 V_SUFFIX3 V_SUFFIX4 V_SUFFIX5 V_SUFFIX6 V_SUFFIX7
V_SUFFIX8 V_SUFFIX9 V_SUFFIX10 V_SUFFIX16 END
V_Root_Irg1: V_SUFFIX1 V_SUFFIX2 V_SUFFIX3 V_SUFFIX4 V_SUFFIX5 V_SUFFIX6 V_SUFFI
X7 V_SUFFIX8 V_SUFFIX9 V_SUFFIX10 V_SUFFIX16 END

V_Root_Irg2: V_SUFFIX1 V_SUFFIX2 V_SUFFIX3 V_SUFFIX4 V_SUFFIX5 V_SUFFIX6 V_SUFFI
X7 V_SUFFIX8 V_SUFFIX9 V_SUFFIX10 V_SUFFIX16 END
V_Root_Irg3: V_SUFFIX1 V_SUFFIX2 V_SUFFIX3 V_SUFFIX4 V_SUFFIX5 V_SUFFIX11 V_SUFF
IX12 V_SUFFIX13 V_SUFFIX14 V_SUFFIX15 V_SUFFIX16 END
V_Root_Irg4: V_SUFFIX6 V_SUFFIX7 V_SUFFIX8 V_SUFFIX9 V_SUFFIX10 END
V_Root_Irg5: V_SUFFIX16 V_SUFFIX17 V_SUFFIX18 V_SUFFIX19 V_SUFFIX20 V_SUFFIX21 V
_SUFFIX27 V_SUFFIX28 V_SUFFIX29 V_SUFFIX30 V_SUFFIX31 END



V_Suffix1: END
V_Suffix2: END
V_Suffix3: END
V_Suffix4: END
V_Suffix5: END
V_Suffix6: END
V_Suffix7: END
V_Suffix8: END
V_Suffix9: END

;; subjunctive suffixes for irregular verb group 3

V_Suffix10: END
V_Suffix11: END
V_Suffix12: END
V_Suffix13: END
V_Suffix14: END
V_Suffix15: END

;; Suffixes for infinitive markers

V_Suffix16: END

;; Suffixes for pres subj of ar verbs like cruzar
V_Suffix17: END
V_Suffix18: END
V_Suffix19: END
V_Suffix20: END
V_Suffix21: END

;; Suffixes for pres subj of er verbs like coger
V_Suffix22: END
V_Suffix23: END
V_Suffix24: END
V_Suffix25: END
V_Suffix26: END

V_Suffix27: END
V_Suffix28: END
V_Suffix29: END
V_Suffix30: END
V_Suffix31: END
V_Suffix32: END
V_Suffix33: END
V_Suffix34: END
V_Suffix35: END

;; Now we have the specifications for expanding out the arc transition
;; abbreviations and the next states they go to
;; Transition
;; Label name Next state Output on making this arc transition

N_ROOT:
l^piz N_Root1 Noun(pencil)
ciudad N_Root1 Noun(city)
bota N_Root1 Noun(boot)

;; more entries for root nouns follow here

V_ROOT_AR:
hablar V_Root_Ar Verb(speak)

V_ROOT_ER:
coJ V_Root_Er Verb(catch,seize,grab)

V_ROOT_IRG1:
conoC V_Root_Irg1 Verb(know)
pareC V_Root_Irg1 Verb(seem)

V_ROOT_IRG2:
venC V_Root_Irg2 Verb(conquer, defeat)
ejerC V_Root_Irg2 Verb(exercise)

V_ROOT_IRG3:
lleg V_Root_Irg3 Verb(arrive)
pag V_Root_Irg3 Verb(pay)

V_ROOT_IRG4:
KOQ V_Root_Irg4 Verb(cook)

V_ROOT_IRG5:
cruz V_Root_Irg5 Verb(cross)

V_ROOT_IR:
abr V_Root_Ir Verb(open)

V_SUFFIX1:
+o V_Suffix1 (pres.ind,1p,sg)
+co V_Suffix1 (pres.ind,.1p,.sg)
+zo V_Suffix1 (pres.ind,.1p,.sg)

V_SUFFIX2:
+as V_Suffix2 (pres.ind,2p,sg)
+es V_Suffix2 (pres.ind,2p,sg)
+ces V_Suffix2 (pres.ind,2p,sg)

V_SUFFIX3:
+a V_Suffix3 (pres.ind,3p,sg)
+e V_Suffix3 (pres.ind,3p,sg)
+ce V_Suffix3 (pres.ind,3p,sg)

V_SUFFIX4:
+amos V_Suffix4 (pres.ind,1p,pl)
+emos V_Suffix4 (pres.ind,1p,pl)
+imos V_Suffix4 (pres.ind,1p,pl)
+cemos V_Suffix4 (pres.ind,1p,pl)

V_SUFFIX5:
+an V_Suffix5 (pres.ind,3p,pl)
+en V_Suffix5 (pres.ind,3p,pl)
+cen V_Suffix5 (pres.ind,3p,pl)

V_SUFFIX6:
+e V_Suffix6 (pres.subj,1p,sg)
+a V_Suffix6 (pres.subj,1p,sg)
+zca V_Suffix6 (pres.subj,1p,sg)
+za V_Suffix6 (pres.subj,1p,sg)

V_SUFFIX7:
+es V_Suffix7 (pres.subj2p,sg)
+as V_Suffix7 (pres.subj2p,sg)
+zcas V_Suffix7 (pres.subj2p,sg)
+zas V_Suffix7 (pres.subj2p,sg)

V_SUFFIX8:
+e V_Suffix8 (pres.subj,3p,sg)
+a V_Suffix8 (pres.subj,3p,sg)
+zca V_Suffix8 (pres.subj,3p,sg)
+za V_Suffix8 (pres.subj,3p,sg)

V_SUFFIX9:
+emos V_Suffix9 (pres.subj,1p,pl)
+amos V_Suffix9 (pres.subj,1p,pl)
+zcamos V_Suffix9 (pres.subj,1p,pl)
+zamos V_Suffix9 (pres.subj,1p,pl)

V_SUFFIX10:
+en V_Suffix10 (pres.subj,3p,pl)
+an V_Suffix10 (pres.subj,3p,pl)
+zcan V_Suffix10 (pres.subj,3p,pl)
+zan V_Suffix10 (pres.subj,3p,pl)

V_SUFFIX11:
+ue V_Suffix11 (pres.subj,1p,sg)

V_SUFFIX12:
+ues V_Suffix12 (pres.subj2p,sg)

V_SUFFIX13:
+ue V_Suffix13 (pres.subj,3p,sg)

V_SUFFIX14:
+uemos V_Suffix14 (pres.subj,1p,pl)

V_SUFFIX15:
+uen V_Suffix15 (pres.subj,3p,pl)

V_SUFFIX16:
+er V_Suffix16 (infinitive)
+ar V_Suffix16 (infinitive)
+ir V_Suffix16 (infinitive)

;; suffixes for subj pres for ar verbs

V_SUFFIX17:
+e V_Suffix17 (pres.subj,1p,sg)

V_SUFFIX18:
+es V_Suffix18 (pres.subj2p,sg)

V_SUFFIX19:
+e V_Suffix19 (pres.subj,3p,sg)

V_SUFFIX20:
+emos V_Suffix20 (pres.subj,1p,pl)

V_SUFFIX21:
+en V_Suffix21 (pres.subj,3p,pl)

;;suffixes for pres subj for er verbs

V_SUFFIX22:
+a V_Suffix22 (pres.subj,1p,sg)

V_SUFFIX23:
+as V_Suffix23 (pres.subj2p,sg)

V_SUFFIX24:
+a V_Suffix24 (pres.subj,3p,sg)

V_SUFFIX25:
+amos V_Suffix25 (pres.subj,1p,pl)

V_SUFFIX26:
+an V_Suffix26 (pres.subj,3p,pl)

V_SUFFIX27:
+o V_Suffix27 (pres.ind,1p,sg)


V_SUFFIX28:
+as V_Suffix28 (pres.ind,2p,sg)

V_SUFFIX29:
+a V_Suffix29 (pres.ind,3p,sg)

V_SUFFIX30:
+amos V_Suffix30 (pres.ind,1p,pl)

V_SUFFIX31:
+an V_Suffix31 (pres.ind,3p,pl)

V_SUFFIX32:
+es V_Suffix32 (pres.ind,2p,sg)

V_SUFFIX33:
+e V_Suffix33 (pres.ind,3p,sg)

V_SUFFIX34:
+emos V_Suffix34 (pres.ind,1p,pl)

V_SUFFIX35:
+en V_Suffix35 (pres.ind,3p,pl)

NUMBER:
+s Number +PL
'' Number .SG

;; more entries for main verbs here that do not have prefixes
;; the very last entry in the lexicon automaton
;; the transition arc label that specifies the end of a word note how
;; it loops
;; back to the Begin state, and outputs nothing (special keyword
;; Label name Next state Output on making this arc transition

END:
'#' Begin None