Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-09-22
2003-09-09
Vu, Kim (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06618727
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to similarity search engines. More particularly, the invention is a computer-implemented similarity search system and method that allows for efficiently searching very large source databases for similarity search criteria specified in a query. A database to be searched, called the source database, is translated into a hierarchical database having objects composed of children and parent objects that correspond to the categories that a user wants to search. Data to be searched in the hierarchical database is organized into a data structure according to the categories the user wants to search and is given a relative identifier. An indexing structure is created that associates parent and children objects. Children objects are assigned a score that is a quantitative measurement of the similarity between the object and the search criteria. A scoring algorithm, which may be selected by the user, assigns the similarity score. The data and indexing structures provides for efficient similarity searching and the quick reporting of results because searching is done using the data structure categories. Children scores are combined into parent scores according to an algorithm specified by the user. Children scores within a parent may be weighted so that certain child categories may be given more importance when child scores are combined into parent scores. The invention can be utilized for searching most types of large-scale databases.
BACKGROUND
Modern information resources, including data found on global information networks, form huge databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. However, traditional database search methods usually provide precise results, that is either an object in the database meets the search criteria and belongs to the results set or it does not. However, in many cases it is desirable to know how similar an object is to the search criteria, not just whether the object matches the search criteria. This is especially important if the data in the database to be searched is incomplete, inaccurate or contains errors such as data entry errors or if confidence in the search criteria is not great. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant “matches” reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Existing database searching technology is also constrained by another factor: the problem of multiple data sources. Data relevant to investigations is often stored in multiple databases or supplied by third party companies. Combining the data by incorporating data from separate sources is usually an expensive and time consuming systems integration task. However, if a consistent ranking or scoring scheme is used for identifying how similar an object is to the search criteria, then that same search criteria can be used to rank other objects in the same search categories in multiple databases. By using a consistent ranking or scoring scheme, it is possible not only to know how similar the object is to the search criteria, but also how similar objects are to each other and then be able to choose the best match or matches for the search criteria from multiple database sources.
SUMMARY
The present invention, which is a system and method for performing similarity searching, solves the aforementioned needs.
The present invention is a computer implemented method for detecting and scoring similarities between documents in a source database and a search criteria. It uses a hierarchy of parent and child categories to be searched, linking each child category with its parent category. Source database documents are converted into hierarchical database documents having parent and child objects with data values organized using the hierarchy of parent and child categories to be searched. For each child object, a child object score is calculated that is a quantitative measurement of the similarity between the hierarchical database documents and the search criteria and a parent object score are computed from its child object scores. Creating a hierarchy of parent and child categories further comprises assigning an entry in a data structure called a data band to each child category that contains no children categories. Linking each child category with its parent category further comprises assigning an index to connect each child category with its parent category. Converting the source database into a hierarchical database further comprises populating each data band with data values from each child object that contains no children. Each data value is assigned a relative identifier. Calculating a score further comprises, for each data value in the data band that is assigned a relative identifier, assigning a number for the score that represents how similar and dissimilar the value is to the search criteria. The search criteria are contained in a query, which may be generated by a user.
The source database may be a relational database. The hierarchical database may be created by a user mapping between the schema and data in a preexisting source database. The hierarchical database may be stored in a markup software language. The markup language may be Extensible Markup Language (XML) or Standard Generalized Markup Language (SGML). The similarity search criteria as specified by the user in the query is also translated into a markup language. Calculating a similarity score comprises comparing the search criteria saved in a markup software language to the data values in the data bands of the hierarchical database. The score calculated may be saved in a score buffer indexed by the relative identifier for the data value. A scoring algorithm may be used to assign a number for the score. Determining a score for each child object comprises, for each data value in the data band that is assigned a relative identifier, using a scoring algorithm to assign a number that represents how similar and dissimilar the value is to the search criteria and saving the score in a score buffer, which may be indexed by the relative identifier for the data value. Alternatively, the scoring method may be non-algorithmic. If the scoring is not algorithmic and if the data value in the data band matches the search criteria, the score number assigned is a value that represents a match between the data value and the search criteria.
The schema may further comprise a hierarchy of parent and child categories to be searched, a scoring method for calculating the score for each child object, a weighting for each child object when there are multiple child objects within a parent object and a parent score computing algorithm for computing a parent object score from the child object scores. The schema may be defined by a user using a graphical user interface or may be previously defined and stored in a database. The saved schema may be retrieved from a database containing stored schemas and used for another similarity search. The schema may further comprise specifying the maximum number of values in the data band on which to perform scoring and score summing and the type and content of a result report generated after the computing of the parent object scores has been completed. The result report may be displayed to the user on a client computer having a graphical user interface.
Schema commands may be compiled by a similarity search engine, relative identification table for the schema created, and data bands to represent the data structure and relation bands created to represent the indexing structure. A document table is created to store user documents when they are imported into the system to be searched. Relative identifiers are assigned to data values in the data bands and to the parent obje
Clay Matthew J.
Wheeler David B.
Fleurantin Jean Bolte
Infoglide Corporation
Taylor Russell & Russell P.C.
Vu Kim
LandOfFree
System and method for performing similarity searching does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for performing similarity searching, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for performing similarity searching will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3054176