Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
2000-05-31
2004-10-26
Chawan, Vijay (Department: 2654)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
C704S002000, C704S004000, C704S008000, C704S007000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06810375
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the field of fully automated, linguistic analysis of unrestricted text in different languages. Specifically, the present invention relates to an automatic method, and a corresponding apparatus, for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses.
BACKGROUND OF THE INVENTION
Although current technology for parsing whole sentences in unrestricted text has improved in recent years, the level of parsing accuracy is still not sufficient to support long intended applications of parsing technology to information systems. For example, existing information systems cannot extract from unrestricted text specific pieces of information that are parallel in lexical, constructional and semantic respects.
Examples of parallel pieces of information are portions of text that have the same agent (=grammatical subject), or the same acted upon (=grammatical object), or involve the same action (=content verb). Such extraction of information is currently only possible from texts in restricted domains. This is due to the fact that commonly used mtchods for information extraction crucially depend on manually acquired, domain specific world knowledge. Consequently, there are large and growing bodies of texts that contain valuable pieces of information that cannot be accessed by standard techniques of information retrieval, because the latter are currently restricted to retrieval of whole documents.
One principal reason why current parsing technology fails to achieve the accuracy required for large-scale applications to unrestricted text is the well-known observation in the art that the performance of parsers degrades as the length of input sentences increases. This is due to the fact that parsers target full sentences as the units to parse. As the length of a sentence increases, so does the combinatorial explosion of alternative ways to combine the well-formed substrings of a sentence that the parser has found.
In order to improve the coverage and accuracy of parsers for unrestricted text, a new divide and conquer strategy is emerging in parsing. The strategy involves the use of simple, finite state parsing techniques in a phase that is preparatory to ‘real’ parsing, which uses more complex techniques. The object of the preparatory stage is to partition text exhaustively into a sequence of units referred to as chunks or segments, in order to facilitate and improve later processing.
Clause segmentation is emerging as a recognized problem area. However, there is no agreement among practitioners in the field on the definition of the clauses that should result from clause segmentation, or on terminology. Units that are clauses or ‘clause like’ are referred to by many different names.
For the purpose of the discussion in this background section, a simple clause is a unit of information that roughly corresponds to a simple proposition, or fact. Current information retrieval technology is not based on clauses as units of information that can be used in rapid creation of databases of reported facts, that involve agents and actions of interest to end-users of information systems. An important motivation for clause segmentation is that it enables automatic recognition of basic grammatical relations within clauses (subject, object, etc). Because of this, clause segmentation makes it possible for later processes to determine which pieces of text exhibit lexical, constructional and semantic parallelism of information.
Existing methods for identifying clauses and segmenting text into clauses rely on first finding phrases within sentences, such as noun phrases and other phrases, before finding clause units within sentences. When clause units have been found, they make it possible to determine clause boundaries, i.e. where a clause begins and ends.
In Nelson, W. & Kucera, H., “Frequency Analysis of English Usage”, 1982, Houghton Mifflin Company, Boston, pp. 549-556, hereafter Nelson&Kucera-1982, Kucera used a finite state automaton for finding verb groups in part-of-speech tagged text in the Brown corpus, and for classifying verb groups into finite and non-finite. A verb group is finite if it contains a verb in the present or past tense. A verb group is non-finite if it contains no tensed verb, i.e. if it consists of an infinitive or a present or past participle. It is commonly agreed in traditional and modern grammar that a verb group implies a predication, equivalently a clause, and that finite and non-finite predications are syntactically distinct, though related types of predications.
The disadvantage of Kucera's 1982 finite state automaton is that it does not address the problem of identifying the location of boundaries between predication units, i.e. it is not a method that segments text into predication units. Although a subsequent patent entitled ‘Sentence analyzer’ to Kucera et al (U.S. Pat. No. 4,864,502) indirectly locates clause boundaries, this technique is based on first finding phrases within sentences, followed by identification of clauses, and thereafter clause boundaries.
Other techniques that analyze sentences internally first, before locating clause boundaries are known, for example: Grefenstette, G., “Light parsing as finite state filtering”, in A. Kornai (Ed), Extended Finite State Models of Language, 1999, Cambridge University Press, Cambridge, U.K., pp. 86-94; and Ramshaw, L. & Marcus, M., “Text chunking using transformation-based learning”, in Proceedings of the Third Workshop on Very Large Corpora, D. Yarowsky & K. Church, Eds, June 1995, M.I.T., Cambridge, Mass., pp. 82-94. These techniques use finite state marking transducers on part-of-speech tagged text as input. The marking transducers mark both contiguous groups of nouns and contiguous groups of verbs in the output. A sentence is implicitly equated with a predication, which is assumed to be a combination of one verb group with one or more noun groups.
A serious problem with the approach is that it gives bad results for sentences that consist of several clauses. The reason is that group marking transducers typically do not recognize sentence internal clauses as clausal units.
There are other known techniques for clause segmentation, described in: Ejerhed, E., “Finding clauses in unrestricted text by finitary and stochastic methods”, in Second Conference on Applied Natural Language Processing, 1988, ACL, Austin, Tex., pp. 219-227, and in Abney, S. P., “Rapid incremental parsing with repair”, in Proceedings of the 6th New OED Conference, 1990, Waterloo, Ontario, University of Waterloo, pp. 1-9. For Ejerhed's and Abney's techniques, the input to the recognition of clause segments consists of part-of-speech tagged text, in which basic noun phrases have also been recognized by probabilistic techniques as described by U.S. Pat. No. 5,146,405 to Church. A problem for both of these two techniques is the following. If the recognition of a basic noun phrase is not correct, then this may result in an error in clause segmentation. For example, if a long noun phrase that has been recognized really should be analyzed as two noun phrases, then a possible clause boundary location is inaccessible.
In the framework of Constraint Grammar, there is also a module for detecting sentence internal clause boundaries, described in Karlsson et al, “Constraint Grammar: A language independent system for parsing unrestricted text”, 1995, Mouton de Gruyter, Berlin/New York, pp. 1-430. However, the authors report (on pages 213, 238) that the mechanism for identifying sentence internal clause boundaries is problematic and rather unsophisticated, and as a result, the other modules of constraint syntax to a great extent have to do without it.
SUMMARY OF THE INVENTION
An objective of the present invention is to provide an improved method for clause boundary detection and segmentation of unrestricted text into clauses, that is not subject to the foregoing disadvantages of existing methods for these tasks.
The invention is based o
Burns Doane Swecker & Mathis L.L.P.
Chawan Vijay
Hapax Limited
LandOfFree
Method for segmentation of text does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for segmentation of text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for segmentation of text will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3297857