Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-08-28
2004-11-23
Wassum, Luke S (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06823331
ABSTRACT:
FIELD OF THE INVENTION
The invention pertains to the field of text interpretation, representation and reduction and, more particularly, to a computer system and method for intelligently identifying concept(s) relating to an electronic document and using this knowledge to reduce and/or represent the text content of an electronic document (which may be any type of electronic document including Web pages, electronic messages such as e-mail, converted voice, fax or pager message or other type of electronic document).
BACKGROUND OF THE INVENTION
The volume of information in the form of text, particularly electronic information, being communicated to users is increasing at a very high rate and such information can take many forms such as simple voice or electronic messages to full document attachments such as technical papers, letters, etc. Because of this, there is a growing need in the communications, data base management and related electronic information industries for means to intelligently condense electronic text information for purposes of assisting the user in handling such communications and for effective classification, archiving and retrieval of the information.
The known document condensers (sometimes also referred to as key word/phrase “extractors” or as “summarizers”), which typically function to identify a set of key words/phrases by utilizing various statistical algorithms and/or pre-set rules, have had limited success and limited scope for application. One such known method of condensing text is described in Canadian Patent Application No. 2,236,623 by Turney which was laid open on 23 December, 1998; the Turney method disclosed by this reference relies upon the use of a preliminary teaching procedure in which a number of pre-set teaching modules, directed to different document categories or academic fields, are provided and a selected one is run prior to using the text condenser in order to revise and tune a set of rules used by the condenser so as to produce the best results for documents of a selected category or within the selected academic field.
However, such prior condensers do not advance the art appreciably because they are primarily statistically based and do not meaningfully address semantic or global linguistic factors which might affect or govern the document text. As such they generally produce only lengthy sets or strings of key words and phrases per se and the relationships or concepts between those key words and phrases is often lost in the resulting summary. The prior condensers also ignore the intent of the electronic document and, hence, treat news, articles, discussions, journal papers, etc. generically.
In the applicant's co-pending U.S. application Ser. No. 09/494,312 filed on 21 January, 2000, which is incorporated herein by reference, there is disclosed a computer-readable system for intelligently analyzing and highlighting key words/phrases, key sentences and/or key components of an electronic document by recognizing and utilizing the context of both the electronic document and the user. In accordance with that system a document map is created by removing from the input document the white space (i.e. formatting such as line spacing), designated first stage “exclude” words, which may be defined as conjunctive words (i.e. such as the words “and”, “with”, “but”, “to”, “however”, etc.), articles (i.e. such as the words “the”, “a”, “an”, etc.), forms and tenses of the words “to have” and “to be” and other filler words such as “thanks”, “THX” “bye” etc., and then the text is stemmed by removing suffixes from applicable words to produce the root thereof (lower case letters only and without punctuation). For example, the words “computational” and “computer” would both be stemmed to the same root viz. “comput”. The document map preserves the sentence and paragraph structure of the document and includes stem maps and a frequency count designation is assigned to each stem such that it provides a complete list of all word/phrase stems with a frequency count per stem and sentence demarcation (a phrase being a preselected number of consecutive words containing no punctuation or exclude words).
The negation key phrases of the document map are identified using a negation words list and by determining whether the word “not” is in any form (e.g. as “n't” in the words “couldn't”, “shouldn't”, “wouldn't”, “won't”, etc.) present in a phrase. These negation key phrases are flagged and given a weight for purposes of scoring them. The action key phrases of the document map are identified using a verbs list and they are scored on the basis of assigned context weights and conditions. The remaining words/phrases of the document are scored in the manner described in the aforementioned Canadian patent application No. 2,236,623 to Turney but with the important improvement of making use of context determinations of the system which identify “include/exclude” words/phrases. In addition, sentences are scored whereby sentences in a document having a higher number of highly ranked words/phrases are themselves, as a whole, given a relatively high ranking.
The inventor herein has discovered that the interpretation and summarization of the text of an electronic document is improved by determining the concept(s) to which the text relate(s) and, in appropriate cases, utilizing this knowledge of the governing concept to produce a representation of the text content rather than a simple summarization or condensed extract thereof.
SUMMARY OF THE INVENTION
In accordance with the invention there is provided a computer-readable concept identification system and for use in reducing and/or representing text content of an electronic document. A concept knowledge base includes a plurality of concepts wherein each concept comprises one or more subconcepts linked to each other and to such concept on a hierarchical basis and wherein one or more of the subconcepts may be linked to one or more subconcepts of another concept. A concept matching module matches text of the document to subconcepts of the concept knowledge base and assesses any links between the matched subconcepts and other concepts and/or subconcepts of the concept knowledge base. From this a determination is made whether the document relates to a concept of the knowledge base. The subconcepts preferably include synonyms therefore.
A document representation generator may be provided for producing a precis of the document based on a template associated with the determined concept. An output module is provided for communicating an identification of the concept determined by the matching module.
Also in accordance with the invention there is provided a computer-readable system and method for highlighting the content of an electronic document and producing therefrom an electronic output highlight document. A concept identification system is provided according to the foregoing and a highlighter module is provided for determining key content of the input document. The highlighter module includes a comparing module for comparing content of the input document to the subconcepts of the concept knowledge base for the determined concept for purposes of determining the key content. An interface integrates the concept identification system and the highlighter module. An output module produces an output highlight document from the key content.
A document mapping module is preferably provided for producing a static document map of the content of the input document, wherein the highlighter module applies to the static document map weightings derived from determinations made by the comparing module.
REFERENCES:
patent: 4914590 (1990-04-01), Loatman et al.
patent: 5588009 (1996-12-01), Will
patent: 5619648 (1997-04-01), Canale et al.
patent: 5635918 (1997-06-01), Tett
patent: 5652789 (1997-07-01), Miner et al.
patent: 5742905 (1998-04-01), Pepe et al.
patent: 5774845 (1998-06-01), Ando et al.
patent: 5794050 (1998-08-01), Dahlgren et al.
patent: 5802253 (1998-09-01), Gross et al.
patent: 5825759 (1998-10-01), Liu
pat
Cassan Maclean
Entrust Limited
Wassum Luke S
LandOfFree
Concept identification system and method for use in reducing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Concept identification system and method for use in reducing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Concept identification system and method for use in reducing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3353660