Method and apparatus for extracting short runs of ambiguity...

Data processing: generic control systems or specific application – Specific application – apparatus or process – Digital audio data processing system

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Method and apparatus for extracting short runs of ambiguity... Method and apparatus for extracting short runs of ambiguity...

: 2000-12-18
: 2004-07-06
: Woo, Stella (Department: 2643)
: Data processing: generic control systems or specific application
: Specific application, apparatus or process
: Digital audio data processing system

: C704S009000
: Reexamination Certificate
: active
: 06760636
: ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates to finite-state language processing, and more particularly to methods for efficiently processing finite-state networks in language processing and other applications.
2. Description of Related Art
Many basic steps in language processing, ranging from tokenization to phonological and morphological analysis, disambiguation, spelling correction, and shallow parsing can be performed efficiently by means of finite-state transducers. Such transducers are generally compiled from regular expressions, a formal language for representing sets and relations. Although regular expressions and methods for compiling them into automata have been part of elementary computer science for decades, the application of finite-state transducers to natural-language processing has given rise to many extensions to the classical regular-expression calculus.
The term language is used herein in a general sense to refer to a set of strings of any kind. A string is a concatenation of zero or more symbols. In the examples set forth below, the symbols are, in general, single characters such as “a”, but user-defined multicharacter symbols such as “+Noun” are also possible. Multicharacter symbols are considered as atomic entities rather than as concatenations of single-character strings. A string that contains no symbols at all is called the empty string and the language that contains the empty string but no other strings is known as the empty string language. A language that contains no strings at all, not even the empty string, is called the empty language or null language. The language that contains every possible string of any length is called the universal language.
A set of ordered string pairs such as {<“a”, “bb”>, <“cd”, “”>} is called a relation. The first member of a pair is called the upper string, and the second member is called the lower string. A string-to-string relation is a mapping between two languages: the upper language and the lower language. They correspond to what is usually called the domain and the range of a relation. In this case, the upper language is {“a”, “cd”} and the lower language is {“bb”, “”}. A relation such as {<“a”, “a”>} in which every pair contains the same string twice is called an identity relation. If a relation pairs every string with a string that has the same length, the relation is an equal-length relation. Every identity relation is obviously an equal-length relation.
Finite-state automata are considered to be networks, or directed graphs that consist of states and labeled arcs. A network contains a single initial state, also called the start state, and any number of final states. In the figures presented herewith, states are represented as circles and arcs are represented as arrows. In the included diagrams, the start state is always the leftmost state and final states are marked by a double circle. Each state acts as the origin for zero or more arcs leading to some destination state. A sequence of arcs leading from the initial state to a final state is called a path. An arc may be labeled either by a single symbol such as “a” or a symbol pair such as “a:b”, where “a” designates the symbol on the upper side of the arc and “b” the symbol on the lower side. If all the arcs of a network are labeled by a single symbol, the network is called a simple automaton; if at least one label is a symbol pair the network is a transducer. Simple finite-state automata and transducers will not be treated as different types of mathematical objects herein. The framework set forth herein reflects closely the data structures in the Xerox implementation of finite-state networks.
A few simple examples illustrating some linguistic applications of finite-state networks are set forth below. The following sections will describe how such networks can be constructed.
Every path in a finite-state network encodes a string or an ordered pair of strings. The totality of paths in a network encodes a finite-state language or a finite-state relation. For example, the network illustrated in
FIG. 1
encodes the language {“clear”, “clever”, “ear”, “ever”, “fat”, “fatter”}.
Each state in
FIG. 1
has a number, thereby facilitating references to paths through the network. There is a path for each of the six words in the language. For example, the path <0-e-3-v-9-e-4-r-5> represents the word “ever”. A finite-state network is a very efficient encoding for a word list because all words beginning and ending in the same way can share a part of the network and every path is distinct from every other path.
If the number of words in a language is finite, then the network that encodes it is acyclic; that is, no path in the network loops back onto itself. Such a network also provides a perfect hash function for the language, a function that assigns or maps each word to a unique number in the range from 0 to n−1, where n is the number of paths in the network.
The network illustrated in
FIG. 2
is an example of a lexical transducer. It encodes the relation {<“leaf+NN”, “leaf”>, <“leaf+NNS”, “leaves”>, <“left+JJ”, “left”>, <“leave+NN”, “leave”>, <“leave+NNS”, “leaves”>, <“leave+VB”, “leave”>, <“leave+VBZ”, “leaves”>, <“leave+VBD”, “left”>}. The substrings beginning with “+” are multicharacter symbols.
In order to make the diagrams less cluttered, it is traditional to combine several arcs into a single multiply-labeled arc. For example, the arc from state
5
to state
6
abbreviates four arcs that have the same origin and destination but a different label: “+NN:0”, “+NNN:s”, “+VB:0”, “+VBZ:s”. In this example, “0” is the epsilon symbol, standing for the empty string. Another important convention illustrated in
FIG. 2
is that identity pairs such as “e:e” are represented as a single symbol “e”. Because of this convention, the network in
FIG. 1
could also be interpreted as a transducer for the identity relation on the language.
The lower language of the lexical transducer in
FIG. 2
consists of inflected surface forms “leaf”, “leave”, “leaves”, and “left” (i.e., language to be modeled). The upper language consists of the corresponding lexical forms or lemmas, each containing a citation form of the word followed by a part-of-speech tag.
Lexical transducers can be used for analysis or for generation. For example, to find the analyses for the word “leaves”, one needs to locate the paths that contain the symbols “l”, “e”, “a”, “v”, “e”, and “s” as such on the lower side of the arc label. The network in
FIG. 2
contains three such paths:
0-1-1-e-2-a-3-v-4-e-5-+NNS:s-6,
0-1-1-e-2-a-3-v-4-e-5-+VBZ:s-6,
0-1-1-e-2-a-3-f:v-8-+NNS:e-9-0:s -6.
The result of the analysis is obtained by concatenating the symbols on the upper side of the paths: “leave+NNS”, “leave+VBZ”, and “leaf+NNS”.
The process of generating a surface form from a lemma, say “leave+VBD”, is the same as for analysis except that the input form is matched against the upper side arc labels and the output is produced from the opposite side of the successful path or paths. In the case at hand, there is only one matching path:
0-1-1-e-2-a:f-12-v:t-13-e:0-14-+VBD:0-6
This path maps “leave+VBD” to “left”, and vice versa.
The term “apply” is used herein to describe the process of finding the path or paths that match a given input and returning the output. As the example above shows, a transducer can be applied downward or upward. There is no privileged input side. In the implementation described here, transducers are inherently bi-directional.
Lexical transducers provide a very efficient method for morphological analysis and generation. A comprehensive analyzer for a language such as English, French, or German contains tens of thousands of states and hundreds of thousands of arcs, but it can be compressed to a relatively small size in the range of

Affiliated with

Kempe Andre

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Oliff & Berridg,e PLC

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Woo Stella

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Xerox Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for extracting short runs of ambiguity... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for extracting short runs of ambiguity..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for extracting short runs of ambiguity... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3251829

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure