System and method for dynamic index-probe optimizations for...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06640224

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to high-dimensional similarity searches, and more particularly to methods for classifying a large number of Web documents in a Web taxonomy.
2. Description of the Related Art
High-dimensional search is an important operation in multimedia databases that contain text documents, audio, and video. To facilitate such searching, database objects can be classified in a taxonomy that has a tree-like structure. For illustration, text documents are used herein as examples of such database objects, it being understood that the present invention applies equally to other genre of objects.
In developing a classification taxonomy, supervised learning can be used, wherein a few training documents initially are assigned to the various nodes of a taxonomy and subsequent documents are then classified based on comparisons with the training documents. Generally, a document will be classified at a leaf node in a taxonomy when the leaf node contains training documents that are “closest” to the document sought to be classified. For example, in so-called Bayesian classifiers, each node “c” (also referred to herein as a “class” or “classification”) in a taxonomy tree has an associated document model that is defined by the training documents. When a test document “d” is to be classified, a posterior probability that measures the likelihood that the test document “d” could have been generated by the class “c”, denoted Pr[c|d], is determined for each class “c”. The test document is classified as belonging to the class “c” having the highest posterior probability for that document.
In undertaking the classification process, Bayesian classifiers collect term occurrences and estimate statistical parameters &thgr;(c,t) which are measures of the fractional rate at which the term “t” occurs in the class “c”. For large data sets, e.g., databases of Web pages, the parameters &thgr;(c,t) cannot be cached entirely in local memory, but must instead be stored on, e.g., a disk or other storage device. As will be appreciated, data accesses to disks and other such storage devices consume a great deal of time, compared to data accesses to local memory.
Thus, in contrast to an index probe being routed down a so-called “B-Tree” or “R-Tree”, the above-discussed high-dimensional classification search operation cannot cache the comparison data that is to be used at each node in memory. Instead, in the present high-dimensional application both the probe and the comparison models at each node are large complex objects, and each step in the classification decision process consequently requires an “outside” data access, typically to a disk. As recognized by the present invention, it would be advantageous to optimize such “outside” data accesses for the case of high-dimensional operations, particularly in contexts, such as classifying millions of Web pages, which require near-constant classification of a seemingly endless supply of large documents.
SUMMARY OF THE INVENTION
The invention is a general purpose computer programmed according to the inventive steps herein to classify documents in a taxonomy. The invention can also be embodied as an article of manufacture—a machine component—that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to undertake the logic disclosed herein. This invention is realized in a critical machine component that causes a digital processing apparatus to undertake the inventive logic herein.
In accordance with the present invention, the computer includes a data storage device including a computer usable medium having computer readable code means for document classification. The code means include computer readable code means for establishing plural data tables, with the data tables including a taxonomy table containing data related to a classification taxonomy. Also, one of the tables contains data representing statistics related to occurrences of terms in nodes of the taxonomy. Computer readable code means receive documents, and computer readable code means are provided for classifying the documents with respect to the taxonomy by undertaking at least one table join using the plural data tables and the document, such that data access is optimized.
In a preferred embodiment, the classifying means requires no random data input/output (I/O) access, nor does it require in-place table updates. In this way, redundant probes for terms that occur in many documents are eliminated.
In a particularly preferred embodiment, an inner table join and a left outer table join are executed by the classifying means. The table joins can be represented by the expression (using Bayesian notation) logprior[c]+&Sgr;
t∈d∩c
freq[d,t](log&thgr;[c,t]+logdenom[c])−logdenom[c]&Sgr;
t∈d
freq[d,t], wherein d represents the document, t represents at least one term, and c represents a taxonomy.
To enhance the effectiveness of the above-summarized join operations, computer readable code means can process a group of documents using the means for classifying by testing all documents in the group at a test node in the taxonomy, prior to testing any document at a node other than the test node. Thus, entire taxonomy nodes at a time are processed. For such bulk processing, the data tables include an expand table and a result table, and the means for processing the group of documents recursively deletes rows in the expand table and fills the rows with entries from the result table. Moreover, the means for classifying executes a left outer table join, and the result table is populated by results from the left outer table join.
In another aspect, a computer system for wave-front classification processing of a group of documents relative to a taxonomy includes a computer including program structure that processes the group of documents by testing all documents in the group at a test node in the taxonomy, prior to testing any document at a node other than the test node.
In yet another aspect, a computer-implemented method is disclosed for classifying at least one group of documents with respect to a taxonomy while optimizing I/O access. The method includes establishing plural data tables including a taxonomy table containing data related to the taxonomy and a statistic table containing data representing statistics related to occurrences of terms in nodes of the taxonomy. Documents in the group of documents are classified with respect to the taxonomy by executing an inner table join and an outer table join. In accordance with present principles, at least one of the joins uses elements from the taxonomy table and the statistic table, and all documents in the group are tested at a test node in the taxonomy prior to testing any document at a node other than the test node.
In still another aspect, the above-summarized method is stored as a computer program on a computer program device. The program device includes a computer program storage device that is readable by a digital processing apparatus, and the program is on the program storage device. The program includes instructions that can be executed by the digital processing apparatus for performing the method steps of the present invention.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:


REFERENCES:
patent: 5463773 (1995-10-01), Sakakibara et al.
patent: 5625767 (1997-04-01), Bartell et al.
patent: 5835905 (1998-11-01), Pirolli et al.
patent: 5873056 (1999-02-01), Liddy et al.
patent: 5895470 (1999-04-01), Pirolli et al.
patent: 5960422 (1999-09-01), Prasad
patent: 5983170 (1999-11-01), Goodman
patent: 6055540 (2000-04-01), Snow et al.
patent: 6094653 (2000-07-01), Li et al.
patent: 6101515 (2000-08-01), Wical et al.
patent: 6125361 (2000-

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for dynamic index-probe optimizations for... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for dynamic index-probe optimizations for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for dynamic index-probe optimizations for... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3153616

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.