System and method for performing similarity searching using...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details System and method for performing similarity searching using... System and method for performing similarity searching using...

: 2000-07-07
: 2004-05-18
: Vu, Kim (Department: 2172)
: Data processing: database and file management or data structures
: Database design
: Data structure types

: C707S793000, C715S252000
: Reexamination Certificate
: active
: 06738759
: ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to similarity search engines. More particularly, the invention is a computer-implemented similarity search system and method that allows for efficiently searching very large source databases for similarity search criteria specified in a query. A database, a document or set of documents comprising the data to be searched are translated into a hierarchical database, document or set of documents of root, interior and leaf nodes that correspond to the categories that a user wants to search. The leaf nodes contain the data items to be searched. A unique identifier, called a pointer, is assigned to each unique data item within a specified context. An index associates each child node with its parent. During the similarity search, data items within the leaf nodes are assigned a score that is a quantitative measurement of the similarity between the object and the search criteria. A scoring algorithm, which may be selected by the user, assigns the similarity score. The data and indexing structures provide for efficient similarity searching and the quick reporting of results because the data is organized by the categories a user wants to search. Assigning a pointer or unique identifier only to unique data items causes the memory requirements to be reduced and optimizes the search time. Depending upon the particular set of documents being searched and the search criteria, significant improvements in speed and memory requirements are achieved. Leaf node scores are combined into its parent scores according to an algorithm, which may be specified by the user. Leaf node or child scores within a parent may be weighted so that certain child categories may be given more importance when leaf nodes (child) scores are combined into parent scores. The invention can be utilized for searching most types of large-scale databases.
BACKGROUND
Modern information resources, including data found on global information networks, form huge databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. However, traditional database search methods usually provide precise results, that is, either an object in the database meets the search criteria and belongs to the results set or it does not. However, in many cases it is desirable to know how similar an object is to the search criteria, not just whether the object matches the search criteria. This is especially important if the data in the database to be searched is incomplete, inaccurate or contains errors such as data entry errors or if confidence in the search criteria is not great. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant “matches” reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Existing database searching technology is also constrained by another factor: the problem of multiple data sources. Data relevant to investigations is often stored in multiple databases or supplied by third party companies. Combining the data by incorporating data from separate sources is usually an expensive and time consuming systems integration task. However, if a consistent ranking or scoring scheme is used for identifying how similar an object is to the search criteria, then that same search criteria can be used to rank other objects in the same search categories in multiple databases. By using a consistent ranking or scoring scheme, it is possible not only to know how similar the object is to the search criteria, but also how similar objects are to each other and then be able to choose the best match or matches for the search criteria from multiple database sources.
Existing database searching, including similarity searching technologies for searching databases, particularly-very large databases, may also take an unacceptably long period of time to complete any search due to the large quantity of data and the particular search techniques used. The amount and quantity of data being searched today, both in traditional database searching and when searching for information located throughout or accessed through a global communications network such as the Internet requires optimized search techniques to provide users with fast as well as accurate search results.
SUMMARY
The present invention, which is a system and method for performing similarity searching, solves the aforementioned needs.
The present invention is a computer implemented method for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria using a set of hierarchical documents having root, interior and leaf nodes where the leaf nodes contain data items. Using assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents, a data item score is computed for each data item in each leaf node that represents a similarity between the data item in the leaf node and the search criteria.
A root node is a node that has no parent node and is a parent of at least one child node selected from the group consisting of interior nodes and leaf nodes. An interior node has a parent node and the interior node is itself a parent node having at least one child node selected from the group consisting of interior nodes and leaf nodes. A leaf node is a child node that has no children and the leaf node has a parent node selected from the group consisting of root nodes and interior nodes.
A parent node score is computed by combining the data item scores for all its child nodes. The data item score is a number that represents how similar and dissimilar the data item is to the search criteria. The method further comprises computing an interior node score for all interior nodes by combining the scores for all the child nodes of the interior nodes. The method further comprises computing a root node score by combining the interior node scores for the children of the root node.
The method further comprises a schema having a hierarchy, wherein the schema describes an organization of the set of hierarchical documents. The schema defines a hierarchy of parent and child nodes within the set of hierarchical documents. A node label is assigned to each node in the schema.
The method further comprises converting at least one document into at least one hierarchical document having root, interior and leaf nodes, wherein said root, interior and leaf nodes correspond to the nodes of the schema. The method further comprises converting at least one document into at least one hierarchical document having at least one root node and at least one leaf node, wherein said root and leaf nodes correspond to the nodes of the schema. Converting the documents comprises allowing a user to map between the schema and documents in a preexisting database to form the set of hierarchical documents. The preexisting database may be a relational database. The hierarchical documents are stored in Extensible Markup Language (XML).
The assigned unique identifiers for each unique data item contained within each leaf node are unique within a selected context in the set of hierarchical documents. The context for a node may be its position in the schema or the set of node labels that comprise its position in the schema.
The method further comprises reserving space in a score buffer for each assigned unique identifier and associating the score for the data item for each assigned unique identifier with its reserved space in the score buffer. The score buffer may be indexed by the data item's assigned unique identifier. The assigned unique identifier may be the same for all identical data items for a selected context within the hierarchical database. The context may be selected from the group consisting of its position in the schema and the set

Affiliated with

Clay Matthew J.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Wheeler David B.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Hamilton Monplaisir

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Infoglide Corporation, Inc.

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Taylor Russell & Russell P.C.

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Vu Kim

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for performing similarity searching using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for performing similarity searching using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for performing similarity searching using... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3236786

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure