Google Talk
The Cognitive Machinery Side
Corpora based approaches assume finite number of possible sentences generated by an individual over his lifetime. Let’s analyze the underlying assumptions from a corpora based inference scheme.
First of all, corpus of speaking in human lifetime is only arguably finite. Take the example of a simple sentence with words rearranged creating similar or different meanings.
Dennis listened to the instructor while he did a crossword.
The instructor did a crossword while he listened to Dennis.
The instructor listened to Dennis while he did a crossword.
Listening to Dennis, the instructor did a crossword.
Doing a crossword, instructor listened to Dennis.
While listening to the instructor, Dennis did a crossword.
While doing a crossword, the instructor listened to Dennis.
While doing a crossword, the instructor listened to the instructor. (Repetition may be possible for some words and contexts)
……
Now let’s think about the number of words and phrases in English language.
Additionally, suppose the sentences above is
Take the 10-element combinations of 203,000 word-sense pairs from WordNet and multiply by 8 (number of different enumerations possible in this case) And that calculation does not even account for repetition for some of the words in the corpora. That does not account for sentences other than those with 10 words, either.
The incredible number of possible sentences is a troubling factor. But, it is not the sole reason that makes corpora based schemes prohibitive. Even if we had infinite amount of storage and infinite means to collect excerpts for GoogleTalk, we would still be desolate with bold assumptions about comprehensibility.
The reason is with corpora based approaches, we are dealing with experimental data. We are observing snippets of speech and make fancy calculations based on occurrence frequency. But we are not making sense of what is in the speech. This is where model based approaches fare well.
But why would we want to make sense of speech anyway?
First of all, there is the problem of inclusion, a “talk”’s eligibility to be included in the GoogleTalk repository. There are no definitive guidelines of what a standard person is, let alone a standard set of all sentences to be included in GoogleTalk from a standard person over a lifetime. A simple example would be the classic IQ test case where minorities do worse because of wording that is aligned towards their White counterparts. Across diverse cultures, jobs, socio-economic classes, genders, people have different lexicons. Should we take the talk of a corporate lawyer? Should an Afro-American athlete be representative?
We can only hope to be close enough to being representative. Model based approaches however try to reconstruct machination of speech with domain knowledge. We can argue, given a base of domain information, we could account for the possible variations across individuals, translate between them and enumerate all possible talks in a way more comprehensible than corpora based approach because we are re-creating the machinery with the assumption that humans’ representations of what they want to say is in larger numbers than what they want to say.
Finite Corpora Side
Natural language is fuzzy and creativity cannot be modeled. These are two underlying reasons why empirical approaches would fare better compared to model based approaches.
Let’s start with the first statement. Natural language is “unruly” in the sense that we speak in a way characterized by false starts, hesitations, ungrammatical sentences. A model based approach would exclude ungrammatical examples because they do not make sense and may not be structured with reason in the first place.
We also speak with obscure sentences, long and complex sentences etc. Let’s assume that by complex modeling, we have tackled well with the structure of complex sentences. Unfortunately, linguistic rules can only help us to a certain extent. The means to enumerate meaningful information often requires the use of additional knowledge. Take an example
“If E.T. were to call, how would we explain the distinction between left and right? Hardly, surprisingly, Kant did not consider the problem directly but his 1768 paper suggests a serious difficulty for anyone trying to talk to E.T. about hands and gloves.”
To make sense of natural language enough to generate such a sentence, we would need substantial amount of domain knowledge. At a minimum requirement, we would need to know who E.T. is and who Kant is. Perhaps the most important barrier that makes a model based approach infeasible is the notion of creative thinking. That is the trouble with model based approaches: We know Kant is a person (with domain knowledge) and he could have possibly wrote a paper. But the cognitive machine we create would not be equipped to entertain thoughts on how to talk to E.T. Machines do not have the “will” to synthesize new ideas based on old-new input. Suppose we equip the machine with the will to entertain thoughts. We can model will based on goals. But still, we cannot model will through emotions which spontaneously fire the creative thinking processes.
Additionally, empiricism gives us the power to encompass abstract notions we could not have developed if we were to stick to a model-based approach. That is, a writer’s abstract statement such as “hear the crying of the lonely swan in my breast” could have not been generated in a model-based system.
Natural language is in principle infinite, whereas corpora are finite, so many examples will be missed even if we try to store millions of entries in GoogleTalk.. But we can argue domain knowledge and rule creation for modeling is as prohibitive as actual storage of sentences with additional advantage of capturing unstructured speech derived from corpora based storage.

0 Comments:
Post a Comment
<< Home