Hierarchy statistical analysis system and method

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C715S252000

Reexamination Certificate

active

06678692

ABSTRACT:

BACKGROUND OF THE INVENTION
This invention relates generally to analysis of data within a hierarchical structure and, more specifically, to analysis of textual data. Many computer users are familiar with textual searching techniques in which documents in a database are selected if they contain user-provided key words. Some textual search engines allow a user to specify key words or phrases in a Boolean combination, such as AND, OR, NOT or NEAR. Other, more advanced textual search engines may count the number of occurrences of specified words in an effort to locate more relevant documents for the user. Frequently, however, key word searching results in a large number of “hits” in documents that are of no interest at all to the user. The key words may be used in many documents in an incidental manner, or in a context that renders the documents of no interest. Hence documents of interest may be missed. The user must then review and discard these superfluous documents, or refine and repeat the search. The principal shortcoming of all key word searching techniques is that they are based on searching the literal form or expression of a document, without regard to context or the ideas or concepts expressed.
There has long been a need for a textual searching technique that allows a user to find documents based on content recognition, by matching selected concepts or ideas, rather than matching key words used in any context at all. The present invention satisfies this need and is also applicable to analyzing and searching non-textual data.
SUMMARY OF THE INVENTION
The present invention resides in a system and corresponding method for characterizing data samples in a hierarchical structure, which facilitates searching of the data based on hierarchical categories or features rather than specific data content. Briefly, and in general terms, the method of the invention comprises the steps of providing a hierarchy of features arranged in a thesaurus-like tree structure having nodes and branches, each node being representative of a feature in the hierarchy; identifying for each database record a plurality of key features that characterize the record; selecting, from the plurality of key features obtained in the identifying step, a node in the hierarchy corresponding to a predominant feature that best characterizes the database record; and associating the predominant feature and its position in the hierarchy with the database record. Database records are then accessible by their predominant features rather than by specific content.
More specifically, the step of selecting a node in the hierarchy corresponding to a predominant feature includes:
comparing each of the selected key features in the record with features in the hierarchy;
recording numbers of occurrences and their node positions for matches between key features of the record and features of the hierarchy;
and determining which node to select, based on whether the node is general enough to encompass a large proportion of the matches, but is not so general as to be too distant from the locations of the matches in the hierarchy.
Further, the step of determining which node to select includes:
computing a coverage value for each branch of the hierarchy, wherein the coverage value is given by a total of all matches recorded at nodes below and connected to the branch;
computing an anticoverage value for each branch of the hierarchy, wherein the anticoverage value is given by the difference between the total number of matches in the hierarchy and the coverage value for the branch;
and computing distance values for nodes of the hierarchy.
The distance value for any node is a function of the coverage and anticoverage values of branches traversed between a top node and the node for which the distance value is computed. The node selected is the one with the lowest distance value.
Even more specifically, the step of computing distance values includes:
assigning a relatively large distance value to the top node of the hierarchy;
computing a distance value for a node that is connected to the top node through a branch, by reducing the top node distance value by the coverage value of the branch, and increasing the result by the anticoverage value of the branch multiplied by a factor ‘a,’ where ‘a’ is greater than unity;
and computing distance values for other nodes in the hierarchy in a similar manner, wherein the distance value for a node at the lower end of a branch is obtained by reducing the distance value of the node at the upper end by the coverage value of the branch, and increasing the result by the anticoverage of the branch multiplied by the factor ‘a.’
Basically, distance values are computed for succession nodes beginning at the top of the hierarchy. After assigning a distance value to the top node, and also after computing a distance value for any other node; the method includes the additional step of selecting a maximum coverage branch to a next lower node for which a distance value will be computed. The branch selected has a larger coverage value than all other branches at an equal level in the hierarchy. Distance values need to be computed only for nodes along a path that traverses the maximum coverage branch through each level of the hierarchy.
The invention may also be defined as a system for classifying database records in accordance with a predominant feature. Briefly, and in general terms, the system comprises at least one thesaurus-like tree structure defining a hierarchy of features, the tree structure having nodes and branches, and each node being representative a feature in the hierarchy; a database of records, each of which is to be classified in accordance with a predominant feature; and a system processor coupled to the database of records and to the thesaurus-like tree structure. The system processor includes means for identifying for each database record a plurality of key features that characterize the record, means for selecting from the plurality of key features a node of the hierarchy corresponding to a predominant feature that best characterizes the database record, and means for associating the predominant feature and its position in the hierarchy with the database record. Database records are then accessible by their predominant features rather than by specific content.
The means for selecting a node in the hierarchy corresponding to the predominant feature includes means for comparing each of the selected key features in the record with features in the hierarchy; means for recording numbers of occurrences and their node positions for matches between key features of the record and features of the hierarchy; and means for determining which node to select, based on whether the node is general enough to encompass a large proportion of the matches, but is not so general as to be too distant from the locations of the matches in the hierarchy. More specifically, the means for determining which node to select includes means for computing a coverage value for each branch of the hierarchy, wherein the coverage value is given by a total of all matches recorded at nodes below and connected to the branch; means for computing an anticoverage value for each branch of the hierarchy, wherein the anticoverage value is given by the difference between the total number of matches in the hierarchy and the coverage value for the branch; means for computing distance values for nodes of the hierarchy, wherein the distance value for any node is a function of the coverage and anticoverage values of branches traversed between a top node and the node for which the distance value is computed; and means for selecting the node with the lowest distance value.
In the system as disclosed, the means for computing distance values includes means for assigning a relatively large distance value to the top node of the hierarchy; and means for computing distance values for other nodes, first for a node that is connected to the top node through a branch, by reducing the top node distance value by the coverage value of the branch, and increasing the resul

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Hierarchy statistical analysis system and method does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Hierarchy statistical analysis system and method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Hierarchy statistical analysis system and method will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3233166

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.