Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-12-31
2002-11-12
Mizrahi, Diane D. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06480835
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of computer systems. More specifically, the present invention relates to information retrieval (IR) technology, in particular to searching over multiple filtering criteria such as both text and topic criteria.
2. Background Information
Modern computer technology allows databases to incorporate ever greater amounts of information. In order to take full advantage of these advances, methods must be developed to allow a user to quickly, easily and inexpensively identify, retrieve, and order information in a database. Effective IR requires that the search be inexpensive and accessible and that the query results be presented in a manner that facilitates searching.
Conventional IR methods for text based documents rely on large, detailed representations of document sets. Documents are represented by an index file that is derived from the terms of the documents through tokenization, stopping, stemming, elimination of capitalization, and inversion. In stopping, common words are eliminated from the document token stream. Tokens which are to be stopped are the most common words in a given language, such as “a” and “the.” Stemming strips tokens of certain suffixes such as “ing”, “ation” and indications of plurality. Thus “Work”, “working” and “works” are represented as “work.” Each term in such a full text index (“FTI”) serves as an index to the documents in which it appears.
A user searches FTIs by creating term-based queries for documents that include specified keywords. The searches may include term position information. Some methods return all documents containing the specified terms and which have fit the specified term location criteria. Other methods calculate a similarity function between the terms in a query and the terms in each document. Such methods may include a document in a search result as being relevant, even if the document does not fit all the query criteria, as long as the similarity value is greater than a threshold.
Certain FTIs preserve information on the location of terms within documents. This allows users to specify adjacency criteria when searching the document set; i.e., to specify that documents matching a query include instances of terms which are adjacent or are in the same sentence, for example.
Such FTI methods require large amounts of storage space. Despite the use of stemming and stopping, virtually every word in the document set must be represented in the index with information on the location of each occurrence of the term in each document in the document set. An FTI may be 50-300% of the size of the document set itself. Generation and maintenance of an index often requires dedicated computers having processing and storage capacities whose cost is beyond the reach both of those maintaining and those accessing the database. Such indexed document sets are typically available only through services, such as Lexis®/Nexis® and Dialog®, and the available indexes are limited to those document sets for which the costs can be justified.
Because such indexes are costly to generate and take up a large amount of storage space, searching on these indexes is typically performed at a site remote to the user but near the document set. This is because the transmission of the indexes to a user and their storage by a user is impractical. In addition, some FTIs contain enough information to reconstruct the original document set, which may be proprietary. Search performance is dependent on data transmission performance and by the availability and workload of remote processors.
Conventional IR methods have limitations in addition to their resource requirements. By the use of stopping, stemming and elimination of capitalization, these methods eliminate information useful to searching. This information is eliminated in order to genericize terms entered as queries and to lower the storage costs of the indexes. While these methods allow for searching based on phrases comprising more than one token, these phrases may not include information eliminated by stopping, stemming and elimination of capitalization.
Conventional IR methods often require a user to enter an exact representation of a phrase and all its variants (i.e. synonyms) in each search query. This is time consuming for the user, and since a user will typically not have the time to contemplate the existence of such variants, documents containing variants of a phrase may not be found. Furthermore, due to the loss of information as a result of stopping, stemming and capitalization elimination, compound terms (i.e. phrases) are not able to be fully defined. Few conventional IR methods allow a definition of a compound term or of the variants of a term to be created prior to indexing using that term. For example, conventional IR methods will not allow for the equivalence of “Federal Bureau of Investigation”, “FBI” and “Federal Bureau” to be defined before indexing.
Conventional IR methods conduct searching over the text of a document set, using combinations of terms as queries. Conventional IR methods allow for searching and categorization by topic (an area of subject matter or any other categorization); however such methods require that the topics be defined after the documents are indexed.
Some search methods include pre-defined topic definitions as well as term specifications. However, such relevancy determinations typically contain terms which are added to a text search query, where the terms are selected to gather documents relevant to the topic. The topic itself is not evaluated relative to the documents.
Because of the resource requirements of conventional IR methods, and because of their limitations when using topics, it is difficult to integrate these methods with graphical searching and graphical query result representation.
Current IR methods do not easily allow for a document index to be filtered prior to use. Thus the full index must usually be accessed by a user, who may be interested in only a small part of the index, and who may not wish to support the resource requirements of the full index.
Current search methods do not allow a user to search using different processors having different capabilities, or to store the state of a search for later use. When a user searches using conventional methods, the search domain—the set of documents over which the user searches (or the set of references to these documents)—is not adjustable at the client level. In an effort to adjust the number of documents returned and narrow a search over a series of iterations, the user often enters an entirely new search for every iteration, replicating information from a previous search. Storing the state of a user search (for example, a set of documents to be searched) eliminates this problem. While current commercial search engines allow for a search state to be maintained, this state is maintained at the server processor, which must devote large amounts of storage resources to maintain process states for the numerous users serviced by the server processor. A user must communicate with a server processor to choose between search states, and thus is limited by communication delays and server processor workload delays. A need exists for a search method which stores a search state locally to the user.
Therefore, there is a need for a more inexpensive and more resource efficient, yet effective, method to search a set of documents. There is a need to perform such a search on a processor which is local to the user and which is remote from the document set. There is a need for an efficient and effective search method which allows users to search across different filtering criteria. There is a need for a search method which may allow for graphical searching and graphical query result representation on a local, user processor. There is no search method allowing for searching based on phrases which include information normally eliminated by stopping, stemming, and elimination of capitalization or searching based on variants of phrases or terms. Th
Intel Corporation
Kenyon & Kenyon
Mizrahi Diane D.
LandOfFree
Method and system for searching on integrated metadata does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for searching on integrated metadata, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for searching on integrated metadata will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2920954