Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-06-23
2001-05-15
Breene, John (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C706S012000
Reexamination Certificate
active
06233575
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates, generally, to a process, system and article of manufacture for organizing and indexing information items such as documents by topic, and in preferred embodiments, to such a process, system and article which employ a topic hierarchy and involve a determination of discriminating terms and stop terms at each internal node in the topic hierarchy.
2. Description of Related Art
With modern advances in computer technology, modem speeds and network and internet technologies, vast amounts of information have become readily available in homes, businesses and educational and government institutions throughout the world. Indeed, many businesses, individuals and institutions rely on computer-accessible information on a daily basis. This global popularity has further increased the demand for even greater amounts of computer-accessible information. However, as the total amount of accessible information increases, the ability to locate specific items of information within the totality becomes increasingly more difficult.
The format with which the accessible information is arranged also affects the level of difficulty in locating specific items of information within the totality. For example, searching through vast amounts of information arranged in a free-form format can be substantially more difficult and time consuming than searching through information arranged in a pre-defined order, such as by topic, date, category, or the like. However, due to the nature of certain on-line systems, such as the internet, much of the accessible information is placed on-line in the form of free-format text. Moreover, the amount of on-line data in the form of free-format text continues to grow very rapidly.
Search schemes employed to locate specific items of information among the on-line information content, typically depend upon the presence or absence of key words (words included in the user-entered query) in the searchable text. Such search schemes identify those textual information items that include (or omit) the key words. However, in systems, such as the web, or large intranets, where the total information content is relatively large and free-form, key word searching can be problematic, for example, resulting in the identification of numerous text items that contain (or omit) the selected key words, but which are not relevant to the actual subject matter to which the user intended to direct the search.
As text repositories grow in number and size and global connectivity improves, there is a pressing need to support efficient and effective information retrieval (IR), searching and filtering. A manifestation of this need is the recent proliferation of over one hundred commercial text search engines that crawl and index the web, and several subscription-based information multicast mechanisms. Nevertheless, there is little structure on the overwhelming information content of the internet.
Common practices for managing such information complexity on the internet or in database structures typically involve tree-structured hierarchical indices. Many internet directories, such as Yahoo!™ (http://www.yahoo.com) and Infoseek (http://www.infoseek.com) are largely manually organized in preset hierarchies. International Business Machine Corporation has implemented a patent database (http://www.ibm.com/patents) which is organized by the U.S. Patent Office's class codes, which form a preset hierarchy. Digital libraries that mimic hardcopy libraries support some form of subject indexing such as the Library of Congress Catalogue, which is also hierarchical. Such topic hierarchies are referred to herein as “taxonomies.” Taxonomies can provide a means for designing vastly enhanced searching, browsing and filtering systems. Querying with respect to a topic can be more reliable than depending only on the presence or absence of specific words in documents. By the same token, multicast systems such as PointCast (http://www.pointcast.com) are likely to achieve higher quality by registering a user profile in terms of classes in a taxonomy rather than key words.
The danger in querying or filtering by keywords alone is that there may be many aspects to, and often different interpretations of the key words, and many of these aspects and interpretations are irrelevant to the subject matter that the searcher intended to find.
Consider, for example, a situation in which a wildlife researcher is attempting to find information about the running speed of the jaguar, using the conventional Alta Vista™ internet search engine (http://www.altavista.digital.com), with the query “jaguar speed”. In a test search conducted with the above-noted search engine and query, a variety of responses were generated, spanning the car, the Atari™ video game, the football team, and a LAN server, in no particular order. The first page about the animal was ranked 183, and was directed to a fable.
To eliminate the responses on cars, the test query was then changed to “jaguar speed-car-auto”. The top response in the generated results read as follows:
“If you own a classic Jaguar, you are no doubt aware how difficult it can be to find certain replacement parts. This is particularly true of gearbox parts.”
The words car and auto do not occur on this page. There was no cat in the first 50 pages of the generated response. Some search engines such as Alta Vista™ propose additional keywords to refine the query, but, at the time of writing, all of the keyword were related to cars or football.
Even the query “jaguar speed +cat”gave unsatisfactory results. The responses included the word “cat”, but were often about automobiles. The 25th page was the first with information about jaguars, but did not contain the desired information.
In contrast, if a topic taxonomy such as Yahoo™ is used, there is no problem in insisting that the user seeks documents containing “jaguar” in the topical context of animals, not cars. Unfortunately, it is labor-intensive to maintain Yahoo™ manually as the web changes and grows faster than ever. In our test case, even though the search was easily restricted to within animals, no answer could be found within the relatively small collection returned.
Search engines are still an immature technology. Other areas have been researched intensively long before web search engines were devised, and the following discussion surveys the following overlapping areas of related research: Information Retrieval (IR) systems and text databases, data mining, statistical pattern recognition, and machine learning.
For data mining, machine learning, and pattern recognition, the supervised classification problem has been addressed in statistical decision theory (both classical, as in Wald,
Statistical Decision Functions,
1950, and Bayesian, as in Berger,
Statistical Decision Theory and Bayesian Analysis,
1985, each of which is incorporated herein by reference), in statistical pattern recognition (as in Duda and Hart,
Pattern Classification and Scene Analysis,
1973 and Fukunaga,
An Introduction to Statistical Pattern Recognition,
1990, each of which is incorporated herein by reference), in machine learning (as in Weiss and Kulikowski,
Computer Systems that Learn,
1990, Natarajan,
Machine Learning: A Theoretical Approach,
1991, and Langley,
Elements of Machine Learning,
1996, each of which is incorporated herein by reference).
Classifiers can be parametric or non-parametric. Two well-known classes of non-parametric classifiers are decision trees, such as CART (as in Breiman et al,
Classification and Regression Trees,
1984, which is incorporated herein by reference) and C4.5 (as in Quinlan, C4.5:
Programs for Machine Learning,
1993, which is incorporated herein by reference), and neural networks (as in Hush and Horne,
Progress in Supervised Neural Networks,
1993, Lippmann, Pattern Classification using Neural Networks, 1989, and Jain et al,
Artificial Neural Networks,
1996, each of which is incorporated herein by reference. For such classifiers,
Agrawal Rakesh
Chakrabarti Soumen
Dom Byron Edward
Raghavan Prabhakar
Breene John
Channavajjala Srirama
Gates & Cooper LLP
International Business Machines - Corporation
LandOfFree
Multilevel taxonomy based on features derived from training... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multilevel taxonomy based on features derived from training..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multilevel taxonomy based on features derived from training... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2464007