Automated category discovery for a terminological knowledge...

Data processing: artificial intelligence – Knowledge processing system – Knowledge representation and reasoning technique

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S009000, C704S010000

Reexamination Certificate

active

06513027

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is directed toward the field of morphological and ontological systems, and more particularly toward techniques to automatically discover categories in a terminological knowledge base.
2. Art Background
In general, knowledge bases include information arranged to reflect ideas, concepts, or rules regarding a particular problem set. Knowledge bases have application for use in natural language processing systems (a.k.a. artificial linguistic or computational linguistic systems). These types of knowledge bases store information about language. Specifically, natural language processing knowledge bases store information about language, including how terminology relates to other terminology in that language. For example, such a knowledge base may store information that the term “buildings” is related to the term “architecture,” because there is a linguistic connection between these two terms.
Natural language processing systems use knowledge bases for a number of applications. For example, natural language processing systems use knowledge bases of terminology to classify information or documents. One example of such a natural language processing system is described in U.S. Pat. No. 5,694,523, entitled “Content Processing System for Discourse,” issued to Kelly Wical on Dec. 2, 1997, which is expressly incorporated herein by reference.
Terminological knowledge bases also have application for use in information search and retrieval systems. In this application, a knowledge base may be used to identify terms related to the query terms input by a user. One example for use of a knowledge base in an information search and retrieval system is described in U.S. patent application Ser. No. 09/095,515, entitled “Hierarchical Query Feedback in an Informative Retrieval System,” Inventor Mohammad Faisal, filed on Jun. 10, 1998 and U.S. patent application Ser. No. 09/170,894, entitled “Ranking of Query Feedback Terms in an Information Retrieval System,” Inventors Mohammad Faisal and James Conklin, filed on Oct. 13, 1998, both of which are incorporated herein by reference.
One type of terminological knowledge base, disclosed in U.S. patent application Ser. No. 09/095,515, associates one or more terms or concepts with categories of the knowledge base. For example, the category “operating systems” may include a number of concepts, although associated with the category “operating systems”, are not categories themselves. For this example, the terms “UNIX”, “Windows '98”, and “Mac OS8” may be associated with the knowledge base category “operating systems.” In one implementation for a terminological knowledge base, there may be hundreds or even thousands of these terms associated with a single category.
As discussed above, natural language processing systems use terminological knowledge bases to classify information, such as documents. If these natural language processing systems classify terms primarily based on categories, then it is desirable to provide as many categories as possible while still maintaining the accuracy of the ontological distinctions. If a single category has associated with it hundreds or thousands of terms, then the categorization of a particular document to a term loses distinction as the number of terms grows large in the single category. Accordingly, a document classified in a category that has too many terms associated with that category becomes difficult to accurately index, with regard to the proper classification of subject matter in that document. Similarly, if the number of concepts in a single category grows too large, then the performance of terminological knowledge bases for use in information search and retrieval systems becomes degraded. For example, if a category of a knowledge base is used to identify additional subject matter areas from a search query, and a single category is associated with 1,000 terms, then the use of that category to identify additional subject matter may become overly inclusive (i.e., too many subject matter areas are identified through the single category in the knowledge base). Accordingly, it is desirable to limit the number of concepts or terms associated with a single category of a terminological knowledge base.
One way of controlling the number of concepts associated with a single category is to split the category up into one or more subcategories. Using this approach, terms within that single category that are related may become subcategories beneath the parent or original category. One approach to splitting or dividing categories is through a linguist's manual interpretation of each category to determine both whether a category should be subdivided, and if so, which terms associated with that category should be subdivided. The manual process of making these determinations is laborious. In addition, if different linguistics use different criteria, the knowledge base may grow to include subcategories based on underlying principles that may differ. Accordingly, it is desirable to automate the process of splitting one or more groups of terms associated with a single category to generate one or more subcategories.
SUMMARY OF THE INVENTION
A terminological system automatically generates sub-categories from categories of a knowledge base. The knowledge base includes a plurality of hierarchically arranged categories, as well as terms associated with the categories. A subset of the categories of the knowledge base are designated “dimensional categories.” A target category in the knowledge base is selected to generate sub-categories for some of the terms associated with the target category. The system also stores a corpus of documents, including themes and corresponding theme weights for each document. A target category is selected to generate sub-categories. A set of themes from the corpus of documents are selected for each term. Dimensional category vectors, one for each term, are generated by associating the set of themes for a term to a dimensional category in the knowledge base. The dimensional category vectors for each term are analyzed to determine if one or more clusters of terminological groups exist. If one or more terminological groups exist, then the terminological groups form terms associated with a new sub-category.


REFERENCES:
patent: 4868733 (1989-09-01), Fujisawa et al.
patent: 5442780 (1995-08-01), Takanashi et al.
patent: 5689716 (1997-11-01), Chen
patent: 5832494 (1998-11-01), Egger et al.
patent: 6199034 (2001-03-01), Wical
Wang Weiwei; Lin Biqin; Chen Fang; Yuan Baozong, A natural language generation system based on dynamic knowledge base, Signal Processing, 1996., 3rd International Conference on, vol.: 1, Oct. 14-18, 1996, pp.: 765-768 vol. 1.*
Mayer, G.; Yamamoto, C.; Evens, M.; Michael, J.A., Constructing a knowledge base from a natural language text, Computer-Based Medical Systems, 1989. Proceedings., Second Annual IEEE Symposium on, Jun. 26-27, 1989, pp.: 98-107, Oct. 1996.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automated category discovery for a terminological knowledge... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automated category discovery for a terminological knowledge..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automated category discovery for a terminological knowledge... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3040152

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.