Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-07-07
2004-05-18
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C715S252000
Reexamination Certificate
active
06738759
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to similarity search engines. More particularly, the invention is a computer-implemented similarity search system and method that allows for efficiently searching very large source databases for similarity search criteria specified in a query. A database, a document or set of documents comprising the data to be searched are translated into a hierarchical database, document or set of documents of root, interior and leaf nodes that correspond to the categories that a user wants to search. The leaf nodes contain the data items to be searched. A unique identifier, called a pointer, is assigned to each unique data item within a specified context. An index associates each child node with its parent. During the similarity search, data items within the leaf nodes are assigned a score that is a quantitative measurement of the similarity between the object and the search criteria. A scoring algorithm, which may be selected by the user, assigns the similarity score. The data and indexing structures provide for efficient similarity searching and the quick reporting of results because the data is organized by the categories a user wants to search. Assigning a pointer or unique identifier only to unique data items causes the memory requirements to be reduced and optimizes the search time. Depending upon the particular set of documents being searched and the search criteria, significant improvements in speed and memory requirements are achieved. Leaf node scores are combined into its parent scores according to an algorithm, which may be specified by the user. Leaf node or child scores within a parent may be weighted so that certain child categories may be given more importance when leaf nodes (child) scores are combined into parent scores. The invention can be utilized for searching most types of large-scale databases.
BACKGROUND
Modern information resources, including data found on global information networks, form huge databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. However, traditional database search methods usually provide precise results, that is, either an object in the database meets the search criteria and belongs to the results set or it does not. However, in many cases it is desirable to know how similar an object is to the search criteria, not just whether the object matches the search criteria. This is especially important if the data in the database to be searched is incomplete, inaccurate or contains errors such as data entry errors or if confidence in the search criteria is not great. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant “matches” reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Existing database searching technology is also constrained by another factor: the problem of multiple data sources. Data relevant to investigations is often stored in multiple databases or supplied by third party companies. Combining the data by incorporating data from separate sources is usually an expensive and time consuming systems integration task. However, if a consistent ranking or scoring scheme is used for identifying how similar an object is to the search criteria, then that same search criteria can be used to rank other objects in the same search categories in multiple databases. By using a consistent ranking or scoring scheme, it is possible not only to know how similar the object is to the search criteria, but also how similar objects are to each other and then be able to choose the best match or matches for the search criteria from multiple database sources.
Existing database searching, including similarity searching technologies for searching databases, particularly-very large databases, may also take an unacceptably long period of time to complete any search due to the large quantity of data and the particular search techniques used. The amount and quantity of data being searched today, both in traditional database searching and when searching for information located throughout or accessed through a global communications network such as the Internet requires optimized search techniques to provide users with fast as well as accurate search results.
SUMMARY
The present invention, which is a system and method for performing similarity searching, solves the aforementioned needs.
The present invention is a computer implemented method for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria using a set of hierarchical documents having root, interior and leaf nodes where the leaf nodes contain data items. Using assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents, a data item score is computed for each data item in each leaf node that represents a similarity between the data item in the leaf node and the search criteria.
A root node is a node that has no parent node and is a parent of at least one child node selected from the group consisting of interior nodes and leaf nodes. An interior node has a parent node and the interior node is itself a parent node having at least one child node selected from the group consisting of interior nodes and leaf nodes. A leaf node is a child node that has no children and the leaf node has a parent node selected from the group consisting of root nodes and interior nodes.
A parent node score is computed by combining the data item scores for all its child nodes. The data item score is a number that represents how similar and dissimilar the data item is to the search criteria. The method further comprises computing an interior node score for all interior nodes by combining the scores for all the child nodes of the interior nodes. The method further comprises computing a root node score by combining the interior node scores for the children of the root node.
The method further comprises a schema having a hierarchy, wherein the schema describes an organization of the set of hierarchical documents. The schema defines a hierarchy of parent and child nodes within the set of hierarchical documents. A node label is assigned to each node in the schema.
The method further comprises converting at least one document into at least one hierarchical document having root, interior and leaf nodes, wherein said root, interior and leaf nodes correspond to the nodes of the schema. The method further comprises converting at least one document into at least one hierarchical document having at least one root node and at least one leaf node, wherein said root and leaf nodes correspond to the nodes of the schema. Converting the documents comprises allowing a user to map between the schema and documents in a preexisting database to form the set of hierarchical documents. The preexisting database may be a relational database. The hierarchical documents are stored in Extensible Markup Language (XML).
The assigned unique identifiers for each unique data item contained within each leaf node are unique within a selected context in the set of hierarchical documents. The context for a node may be its position in the schema or the set of node labels that comprise its position in the schema.
The method further comprises reserving space in a score buffer for each assigned unique identifier and associating the score for the data item for each assigned unique identifier with its reserved space in the score buffer. The score buffer may be indexed by the data item's assigned unique identifier. The assigned unique identifier may be the same for all identical data items for a selected context within the hierarchical database. The context may be selected from the group consisting of its position in the schema and the set
Clay Matthew J.
Wheeler David B.
Hamilton Monplaisir
Infoglide Corporation, Inc.
Taylor Russell & Russell P.C.
Vu Kim
LandOfFree
System and method for performing similarity searching using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for performing similarity searching using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for performing similarity searching using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3236786