Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-04-18
2004-04-27
Alam, Shahid (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C704S001000, C704S009000
Reexamination Certificate
active
06728701
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to the field of information retrieval. In particular, the present invention relates to a method and apparatus for selecting the optimal number of terms for retrieving documents using a vector space analysis technique.
BACKGROUND OF THE INVENTION
Advances in electronic storage technology have resulted in the creation of vast databases of documents stored in electronic form. These databases can be accessed from remote locations around the world. As a result, vast amounts of information are available to a wide variety of individuals. Moreover, information is not only stored in electronic form but it is created in electronic form and disseminated throughout the world. Sources for the electronic creation of such information include news, periodicals, as well as radio, television and Internet services. All of this information is also made available to the world through computer networks, such as the worldwide web, on a real time basis. The problem with this proliferation of electronic information, however, is how any one individual may access useful information in a timely manner.
When a user wants to search for information, she may provide a computer system with a query or description of her interest. For example, a user interested in sport may type the query “basketball from Olympics '96” (the query is a phrase) or may just type the terms “basketball”, and “Olympic '96”. Using grammar rules and a lexicon, a search engine may extract the terms from the query and construct its internal representation of the query, called a profile. In the above examples, the profile will contain the terms “basketball” and “Olympics '96”.
Profile training is the process of improving the formulation of a profile using a set of documents that the user considers representative for her interest (training data). The search engine extracts new terms from the training data and adds them to the initial profile. For example, after entering the query “basketball from Olympics '96”, the user may point the system to an article that describes a basketball game from two days ago. From this article, the system extracts the terms “basketball”, “game”, “ball”, and “score”. Then the profile will contain the terms “basketball”, “Olympics '96”, “game”, “ball”, and “score”. The user may even not provide any initial description of her interest (in which case, the initial profile is empty), but just give the system some training data to extract terms from. In the above example, without an initial description, the profile will contain only the terms extracted from the article, “basketball”, “game”, “ball”, and “score”.
The main input components of profile training are the initial description, the training database, the reference database, and the terms extraction algorithm. The training database contains articles that match the user's interest (training data). The terms extraction algorithm extracts terms from the training data and adds them to the profile. The reference database contains information that helps the extraction algorithm to decide whether or not to include in the profile a term from the training data. This is because the training data may contain terms that are not related to the user's interest and if included in the profile, may return non-relevant documents. In the above example, if the training article mentions that a basketball player likes piano, then adding the term “piano” to the profile will make the search engine retrieve articles related to music, which do not correspond to the user's interest in basketball. The assumption in using a reference database is that the terms extraction algorithm differentiate between the terms in the training data that are linked to the user's interest and the terms that are not.
Typically, the training documents contain a large number of terms. Selecting only the most representative terms from this set can improve efficiency and effectiveness of the retrieval process. To make use of training documents, a terms extraction algorithm creates a list of all the terms from the training data. To every term it attaches a weight based on the information in the reference database. The terms are then sorted in decreasing order of their weights, such that the term with the highest weight is the first. If the search engine wants to add to the profile n terms from the training data, then the first n terms from the sorted list of terms are added to the profile. Therefore, to train a profile we need two important elements: (1) a method to assign weights to the terms, and (2) a cut-off method to determine the number of terms to be added in a profile.
There have been may term selection methods proposed in literature based on the vector space and probabilistic models. Regardless of the method, the number of terms in a profile is generally the value for which experiments show a reasonable behavior (e.g., the first 30 or 50 terms) and it is a constant for all the profiles. There are also methods that associate a different number of terms to each profile. One example is to compute the number of terms with the formula 10+10 log(T), where (T) is the number of training documents per profile. However, the number of terms chosen according to such a formula is generally too large and there are many cases when more flexibility is needed. For example, there are document collections in which many profiles achieve best average precision with just one term. Another method is to compute the sum of the weights for all the terms and add terms in a profile until a specified fraction of the sum is achieved. This approach, again, may not detect the situations when profiles need very few terms.
OBJECTS OF THE INVENTION
It is an object of the present invention to provide a method and apparatus to effect improved information extraction from a variety of data sources.
It is a further object of the present invention to improve information extraction by selecting the appropriate number of terms in creating a profile.
It is a further object of the present invention to improve information extraction by minimizing the number of terms in a profile.
REFERENCES:
patent: 5652829 (1997-07-01), Hong
patent: 5659766 (1997-08-01), Saund et al.
patent: 5675819 (1997-10-01), Schuetze
patent: 5694594 (1997-12-01), Chang
patent: 5774888 (1998-06-01), Light
patent: 5777892 (1998-07-01), Nabity et al.
patent: 5778363 (1998-07-01), Light
patent: 5794178 (1998-08-01), Caid et al.
patent: 5806061 (1998-09-01), Chaudhuri et al.
patent: 6026389 (2000-02-01), Nakajima et al.
patent: 6070133 (2000-05-01), Brewster et al.
patent: 6070134 (2000-05-01), Richardson et al.
patent: 6105023 (2000-08-01), Callan
patent: 6115709 (2000-09-01), Gilmour et al.
patent: 6233575 (2001-05-01), Agrawal et al.
patent: 6314420 (2001-11-01), Lang et al.
patent: 6327590 (2001-12-01), Chidlovskii et al.
patent: 6338057 (2002-01-01), Weeks
patent: 6377949 (2002-04-01), Gilmour
patent: 6389412 (2002-05-01), Light
patent: 6473753 (2002-10-01), Katariya et al.
patent: 6510406 (2003-01-01), Marchisio
patent: 6601026 (2003-07-01), Appelt et al.
“A Network Approach to Probabilistic Information Retrieval”—K. L. KWOK—ACM Transaction Infromation Systems, vol. 13, No. 3, Jul. 1995, (pps: 324-353).
Alam Shahid
Claritech Corporation
Harper Blaney
Jones Day
Ly Anh
LandOfFree
Method and apparatus for database retrieval utilizing vector... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for database retrieval utilizing vector..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for database retrieval utilizing vector... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3201080