Method and system for topical segmentation, segment...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06473730

ABSTRACT:

SPECIFICATION
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of any portion of the patent document, as it appears in any patent granted from the present application or in the Patent and Trademark Office file or records available to the public, but otherwise reserves all copyright rights whatsoever.
An appendix containing source code listing utilized in practicing an exemplary embodiment of the invention is included as part of the Specification and is hereinafter referred to as Appendix A. Appendix A is found on pages 30-59 of the Specification.
FIELD OF THE INVENTION
The present invention relates in general to the field of natural language processing and automatic text analysis and summarization. More particularly, the present invention relates to a method and system for topical segmentation of a document and classification of segments according to segment function and importance.
BACKGROUND OF THE INVENTION
Identification of a document's discourse structure can be extremely useful in natural language processing applications such as automatic text analysis and summarization and information retrieval. For example, simple segmentation of a document into blocks of topically similar text can be useful in assisting text search engines to determine whether or not to retrieve or highlight a particular segment in which a query term occurs. Similarly, topical segments can be useful in assisting summary agents to provide detailed summaries by topic in accordance with a segment function and/or importance. Topical segmentation is especially useful for accurately processing long texts having multiple topics for a wide range of natural language applications.
Conventional methods for topical segmentation, such as in Hearst's TextTiling program, identify zero or more segment boundaries at various paragraph separations, which in turn identify one or more topical text segments. See M. Hearst, “Multi-Paragraph Segmentation of Expository Text,” Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (1994). Topical segmentation is thus linear, but based solely upon the equal consideration of selected terms. Terms are regarded as equally important in deciding how to segment the document input, and as such segmentation does not leverage the differences between term types. TextTiling, in addition, makes no effort to measure the significance and function of identified topical segments.
Other conventional methods use hierarchical segmentation to create tree-like representations of a document's discourse structure. See U.S. Pat. No. 5,642,520; D. Marcu, “The Rhetorical Parsing of Natural Language Texts,”
The Proceedings of the
35
th Annual Meeting of the Association for Computational Linguistics
at pp. 96-103 (1997); Y. Yaari, “Segmentation of Expository Text by Hierarchical Agglomerative Clustering,” Recent Advances in NLP 1997. Bulgaria (1997). Hierarchical segmentation attempts to calculate not only topic boundaries, but also subtopic and sub-subtopic boundaries. This is inherently a more difficult task and can be prone to more sources of error. Researchers also define “topic” differently such that many times a topic boundary in one text can correspond to a subtopic or a supertopic in another segmentation program.
Still other conventional hierarchical schemes, for example, use complex “attentional” models or rules that look at the topic of discussion for a particular sentence; that is, the focus of the sentence. Attentional models are commonly used to determine pronominal resolution, e.g., what person does “he” or “she” refer to in the text, and usually require contextual knowledge that is often difficult to glean from the language input using automated methods. See U.S. Pat. No. 5,642,520.
Again, as with conventional linear segmentation schemes, no effort is made with conventional hierarchical schemes to determine the contextual significance or function of the identified topical segments.
SUMMARY OF THE INVENTION
The aforedescribed limitations and inadequacies of conventional topical segmentation methods are substantially overcome by the present invention, in which a primary object is to provide a method and system for segmenting text documents so as to efficiently and accurately identify topical segments of the documents.
It is another object of the present invention to provide system and method that identifies the significance of identified topical segments.
It is yet another object of the present invention to provide system and method that identifies the function of identified topical segments.
In accordance with a preferred method of the present invention, a method is provided that includes the steps of: extracting one or more selected terms from a document; linking occurrences of the extracted terms based upon the proximity of similar terms; and assigning weighted scores to paragraphs of the document input corresponding to the linked occurrences, wherein the scores depend upon the type of the selected terms and the position of the linked occurrences with respect to the paragraphs, and wherein the scores represent boundaries of the topical segments.
In accordance with another preferred method of the present invention, a method is provided for automatically extracting significant topical information from a document, the method including the steps of: extracting topical information from a document in accordance with specified categories of information; linking occurrences of the extracted topical information based on the proximity of similar topical information; determining topical segments within the document corresponding to the linked occurrences of the topical information; and determining the significance of the topical segments.
In another aspect of the present invention, a computer program is provided for topical segmentation of a document's input. The computer program includes executable commands for: extracting selected terms from a document; linking occurrences of the extracted terms based upon the proximity of similar terms; and assigning weighted scores to paragraphs of the document input corresponding to the linked occurrences, wherein the scores depend upon the type of the selected terms and the position of the linked occurrences with respect to the paragraphs, and wherein the scores represent boundaries for the topical segments.
In yet another aspect of the present invention, a computer program is provided for automatically extracting significant topical information from a document. The computer program includes executable commands for: extracting topical information from a document in accordance with specified categories of information; linking occurrences of the extracted topical information based on the proximity of similar topical information; determining topical segments within the document corresponding to the linked occurrences of the topical information; and determining the significance of the topical segments.


REFERENCES:
patent: 5642520 (1997-06-01), Takeshita et al.
patent: 5748805 (1998-05-01), Withgott et al.
patent: 5799268 (1998-08-01), Boguraev
patent: 5913185 (1999-06-01), Martino et al.
patent: 6038560 (2000-03-01), Wical
patent: 6070133 (2000-05-01), Brewster et al.
patent: 6199034 (2001-03-01), Wical
patent: 6212494 (2001-04-01), Boguraev
M. Hearst, “Multi-Paragraph Segmentation of Expository Text,”Proceedings of the 32ndAnnual Meeting of the Association for Computational Linguistics(1994).
D. Marcu, “The Rhetorical Parsing of Natural Language Texts,”The Proceedings of the 35thAnnual Meeting of the Association for Computational Linguisticsat pp. 96-103 (1997).
Y. Yaari, “Segmentation of Expository Text by Hierarchical Agglomerative Clustering,”Recent Advances in NLP 1997, Bulgaria(1997).
J. Justeson and S. Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text,”Natural Language Engineering, vol. 1(1) at pp. 9-29 (1995).
C

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for topical segmentation, segment... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for topical segmentation, segment..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for topical segmentation, segment... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2993471

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.