Sort system for text retrieval

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06505198

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to the field of database systems. More particularly, this invention relates to a system for efficient document retrieval from a database.
BACKGROUND OF THE INVENTION
The volume of documents in databases is rapidly expanding. It has been estimated that in excess of 90% of all desired intelligence information is available in documents residing in accessible databases. In order for the information in databases to be useful, a user must be able to locate specific documents relating to specific queries. Existing information retrieval systems use inefficient techniques for returning relevant documents. Generally, the existing techniques miss highly relevant documents associated with a user's query. For example, many systems use Boolean logic-based query execution techniques wherein keywords are connected together via logical or proximity operators. Such a Boolean system merely returns a list of documents, wherein each of the documents includes one of the keyword combinations.
The result of a Boolean search has no quantifiable measure of how similar the returned documents are to the query. Quantifiable measures of similarity are very useful in retrieving documents from databases because the documents can be ranked by the quantitifiable measure. In response to the shortcomings of Boolean type searches, vector space type search systems have been developed. In a vector space type search system, a score related to a particular query is computed for each document in the database. In general, a query “Q” and a document“D” can be compared by computing the shared and disjoint features of the query and the document over an orthogonal space of T terms. In such a comparison, for example, a similarity score can be computed by the following formula:
S

(
Q
i
,
D
j
)
=
Q
i
·
D
j
&LeftBracketingBar;
Q
&RightBracketingBar;
·
&LeftBracketingBar;
D
&RightBracketingBar;
=

k
=
1
t

(
q
i
k
·
d
i
k
)

k
=
1
t

q
i
k
2
·

k
=
1
t

d
i
k
2
Where Qi refers to the terms in the query and Dj refers to the terms in the document.
A quantifiable similarity score for a document and query such as computed above is useful because the scores over various documents for a single query can be compared against each other. However, as is clear from an examination of the scoring formula, this scoring formula is significantly affected by variations in the number of terms per document. Since documents in a database typically have a wide range of sizes (e.g., from less than one page to more than hundreds of pages), the scoring must be normalized by size. One way to normalize the scoring is to divide individual documents into subdocuments having approximately the same size. The scoring is then computed on the basis of the subdocument. Also, scores between subdocuments are then analyzed. In this way, mere differences in a number of terms do not significantly skew the similarity analysis.
There are a variety of ways to create subdocuments from documents. A simple way is to create subdocuments that have precisely the same number of terms. Another way is to create subdocuments that have the same number of sentences. Each of these techniques helps to solve the problem of differing size documents. However, each of these techniques ignores the content of the text of the document in creating the subdocument. A technique for creating subdocuments that both forms comparable size subdocuments and takes account of the content of the subdocuments, is to make the subdocuments correspond to the paragraphs in the document.
The result of calculating similarity scores of text based on subdocuments is that a large list is generated that associates a score with a subdocument identifier and a document identifier. The number of entries on this list is much larger than the number of documents in a database because there may be many subdocuments for each document. Additionally, this list is not sorted relative to the subdocument score. Since the reason for calculating the similarity score is typically to operate on a rank ordered (by score) list of subdocuments, this entire list must be sorted by score before any other analysis can be started. The sort operation is generally an inefficient and time consuming process because a complete sort requires N log N operations where N represents the number of subdocuments.
OBJECTS OF THE INVENTION
It is an object of the present invention to analyze documents in a database.
It is a further object of the present invention to retrieve documents or parts thereof from a database that are the most relevant to a query.
It is still a further object of the present invention to retrieve the most relevant documents or parts thereof without completely sorting all of the documents or parts thereof in a database.
It is still a further object of the present invention to reduce the processing time of the computer in retrieving the most relevant documents or parts thereof from a database.
It is still a further object of the present invention to reduce the number of sort operations required by the computer in retrieving the most relevant documents or parts thereof from a database.


REFERENCES:
patent: 4531186 (1985-07-01), Knapman
patent: 4868733 (1989-09-01), Fujisawa et al.
patent: 5043872 (1991-08-01), Cheng et al.
patent: 5099426 (1992-03-01), Carlgren et al.
patent: 5202840 (1993-04-01), Wong
patent: 5206949 (1993-04-01), Cochran et al.
patent: 5369577 (1994-11-01), Kadashevich et al.
patent: 5375235 (1994-12-01), Berry et al.
patent: 5459861 (1995-10-01), Oda
patent: 5465353 (1995-11-01), Hull et al.
patent: 5519608 (1996-05-01), Kupiec
patent: 5519857 (1996-05-01), Kato et al.
patent: 5576954 (1996-11-01), Driscoll
patent: 5619718 (1997-04-01), Correa
patent: 5696962 (1997-12-01), Kupiec
patent: 5715443 (1998-02-01), Yanagihara et al.
patent: 5745891 (1998-04-01), Minakuchi et al.
patent: 5787001 (1998-07-01), Dietrich, Jr. et al.
patent: 5819273 (1998-10-01), Vora et al.
patent: 5907840 (1999-05-01), Evans
patent: 5926808 (1999-07-01), Evans et al.
patent: 5953728 (1999-09-01), Horowitz et al.
patent: 5995962 (1999-11-01), Horowitz
patent: 6138114 (2000-10-01), Horowitz
Lin et al., “Fast generation of long sorted runs for sorting a large file”, pp. 445-456, Sep. 1991.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Sort system for text retrieval does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Sort system for text retrieval, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sort system for text retrieval will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3012859

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.