Saturday, February 25, 2006

Lab 1a




1. The way to work around this issue (double r in inflected forms) is to add the intonation to the input as in
Pronunciation: ri-'f&r (* from Merriam-Webster)

In fact, this was the most time-consuming question for me, partly because I’m not a native speaker. But it is always better than trying similar stuff in a completely different language e.g Spanish

2. Added the liner

`dog V_Root4 Verb(dog)
under
V_ROOT_NO_PREF:

3. This statement is incorrect because recognizer actually checks different combinations of possible words in a branch, moving forward letter by letter. If a given path does not match up, with backtracking an alternative path is chosen. E.g. it can try flies and then go back to flier

As the algorithm is explained in PCKimmo’s website

“The recognizer also consults the lexicon. The lexical items recorded in the lexicon are structured as a letter tree. When the recognizer tries a lexical character, it moves down the branch of the letter tree that has that character as its head node. If there is no branch starting with that letter, the lexicon blocks further progress and forces the recognizer to backtrack and try a different lexical character.”

So the situation can arise when a path leading to the wrong branch is chosen.
The same thing would not happen with generation because for generation we do not have to consult the lexicon letter tree. In a parallel fashion , generator outputs the surface form.

4. I chose to create a single automaton. In terms of extensibility, this approach would not fare well. It would be painful if a language had many rules and trying to combine them into a single automaton would be a bad idea for the benefits of modularity and extensibility. As for transparency, I think this solution had much more intuition in this case. Because in effect we are dealing with a single context with left and right environments. However, combination of many rules with distinct contexts may not be transparent if the contexts are so distinct.

With the following write-up

RULE "N realized as m" 4 6
N N p p m @
m n p m m @
1: 3 4 1 0 2 1
2: 3 4 0 2 2 1
3: 0 0 0 1 0 0
4: 3 4 0 0 1 1

Sunday, February 19, 2006

Google Talk

The Cognitive Machinery Side

Corpora based approaches assume finite number of possible sentences generated by an individual over his lifetime. Let’s analyze the underlying assumptions from a corpora based inference scheme.

First of all, corpus of speaking in human lifetime is only arguably finite. Take the example of a simple sentence with words rearranged creating similar or different meanings.

Dennis listened to the instructor while he did a crossword.
The instructor did a crossword while he listened to Dennis.
The instructor listened to Dennis while he did a crossword.
Listening to Dennis, the instructor did a crossword.
Doing a crossword, instructor listened to Dennis.
While listening to the instructor, Dennis did a crossword.
While doing a crossword, the instructor listened to Dennis.
While doing a crossword, the instructor listened to the instructor. (Repetition may be possible for some words and contexts)
……
Now let’s think about the number of words and phrases in English language.
Additionally, suppose the sentences above is
Take the 10-element combinations of 203,000 word-sense pairs from WordNet and multiply by 8 (number of different enumerations possible in this case) And that calculation does not even account for repetition for some of the words in the corpora. That does not account for sentences other than those with 10 words, either.

The incredible number of possible sentences is a troubling factor. But, it is not the sole reason that makes corpora based schemes prohibitive. Even if we had infinite amount of storage and infinite means to collect excerpts for GoogleTalk, we would still be desolate with bold assumptions about comprehensibility.

The reason is with corpora based approaches, we are dealing with experimental data. We are observing snippets of speech and make fancy calculations based on occurrence frequency. But we are not making sense of what is in the speech. This is where model based approaches fare well.

But why would we want to make sense of speech anyway?

First of all, there is the problem of inclusion, a “talk”’s eligibility to be included in the GoogleTalk repository. There are no definitive guidelines of what a standard person is, let alone a standard set of all sentences to be included in GoogleTalk from a standard person over a lifetime. A simple example would be the classic IQ test case where minorities do worse because of wording that is aligned towards their White counterparts. Across diverse cultures, jobs, socio-economic classes, genders, people have different lexicons. Should we take the talk of a corporate lawyer? Should an Afro-American athlete be representative?

We can only hope to be close enough to being representative. Model based approaches however try to reconstruct machination of speech with domain knowledge. We can argue, given a base of domain information, we could account for the possible variations across individuals, translate between them and enumerate all possible talks in a way more comprehensible than corpora based approach because we are re-creating the machinery with the assumption that humans’ representations of what they want to say is in larger numbers than what they want to say.








Finite Corpora Side

Natural language is fuzzy and creativity cannot be modeled. These are two underlying reasons why empirical approaches would fare better compared to model based approaches.

Let’s start with the first statement. Natural language is “unruly” in the sense that we speak in a way characterized by false starts, hesitations, ungrammatical sentences. A model based approach would exclude ungrammatical examples because they do not make sense and may not be structured with reason in the first place.

We also speak with obscure sentences, long and complex sentences etc. Let’s assume that by complex modeling, we have tackled well with the structure of complex sentences. Unfortunately, linguistic rules can only help us to a certain extent. The means to enumerate meaningful information often requires the use of additional knowledge. Take an example

“If E.T. were to call, how would we explain the distinction between left and right? Hardly, surprisingly, Kant did not consider the problem directly but his 1768 paper suggests a serious difficulty for anyone trying to talk to E.T. about hands and gloves.”

To make sense of natural language enough to generate such a sentence, we would need substantial amount of domain knowledge. At a minimum requirement, we would need to know who E.T. is and who Kant is. Perhaps the most important barrier that makes a model based approach infeasible is the notion of creative thinking. That is the trouble with model based approaches: We know Kant is a person (with domain knowledge) and he could have possibly wrote a paper. But the cognitive machine we create would not be equipped to entertain thoughts on how to talk to E.T. Machines do not have the “will” to synthesize new ideas based on old-new input. Suppose we equip the machine with the will to entertain thoughts. We can model will based on goals. But still, we cannot model will through emotions which spontaneously fire the creative thinking processes.

Additionally, empiricism gives us the power to encompass abstract notions we could not have developed if we were to stick to a model-based approach. That is, a writer’s abstract statement such as “hear the crying of the lonely swan in my breast” could have not been generated in a model-based system.

Natural language is in principle infinite, whereas corpora are finite, so many examples will be missed even if we try to store millions of entries in GoogleTalk.. But we can argue domain knowledge and rule creation for modeling is as prohibitive as actual storage of sentences with additional advantage of capturing unstructured speech derived from corpora based storage.