Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-01-03
2002-05-28
Alam, Hosain T. (Department: 2771)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C704S009000, C709S219000
Reexamination Certificate
active
06397211
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the field of digital libraries. Specifically, it discloses a system and method for identifying useless documents in a document hit list assembled after performing a search among documents stored in a digital library collection, such that these documents can be filtered and eliminated from the document hit list.
BACKGROUND OF THE INVENTION
The task of finding important and relevant documents in an online document collection is becoming increasingly difficult as documents proliferate. Several techniques have been developed within document retrieval systems to assist users in focusing or directing their queries more effectively, such as the Prompted Query Refinement technique described by Cooper et al. in “Lexical Navigation: Visually Prompted Query Expansion and Refinement,” Proceedings of DIGLIB97, Philadelphia, Pa., July, 1997; and by Cooper et al. in “OBIWAN-A Visual Interface for Prompted Query Refinement,” Proceedings of HICSS-31, Kona, Hi., 1998. These references, and all other references referenced in this specification are herein incorporated by reference in their entirety. However, even after a query has been refined, the problem of having to read too many documents still remains.
To counteract such a daunting task of having to read too many documents, techniques have been developed for producing rapid displays of the most salient sentences in a document, as described by Neff et al. in “Document Summarization for Active Markup,” Proceedings of the 32
nd
Hawaii International Conference on System Sciences, Wailea, Hi, January, 1999; and by Neff et al. in “A Knowledge Management Prototype,” Proceedings of NLDB99, Klagenfurt, Austria, 1999. Based on these techniques, users can prefer to read or browse through only those documents returned by a search engine which are important to the area they are investigating. However, even with these summarization techniques, the document retrieval systems are still not able to predict which documents will be most useful to the user.
Other techniques for solving document retrieval problems entail having the user interact with the document retrieval system. For example, one technique described in the literature entails, in a multi-window document interface, having a user to drag terms into search windows and see relationships between terms in a graphical environment. Further, Schatz et al. in “Interactive Tern Suggestion for Users of Digital Libraries,” ACM Digital Library Conference, 1996 describes a multi-window interface that offers user access to a variety of published thesauruses and computed term co-occurrence data. However, these techniques are prone to user errors (e.g., the user selects a term which is non-pertinent to his investigation to further refine the search) and are time-consuming, since user intervention is necessitated. Accordingly, these prior art document retrieval techniques and other known techniques are not capable of filtering document hit lists, such that documents having limited utility, even though they may match many of the search terms fairly accurately, can be removed or downgraded in terms of their ranking, in order to present the most useful documents to the user. Hence, an object of this invention is a system and method for identifying useless documents in a document hit list, such that these documents can be filtered and eliminated from the document hit list.
SUMMARY
The present invention is essentially a system and method for identifying useless or insignificant documents in a document hit list assembled from documents stored in one or more document collection database memories. A search engine is used to compose the document hit list based on a query presented by a user. A text extraction algorithm run by a processor is then used to process the documents identified by the document hit list to produce a table of terms and their corresponding collection-level importance ranking called the IQ or Information Quotient. The text extraction algorithm also produces a table of the most important terms per document. The documents are also scanned independently and a table of documents with filenames and lengths is also produced.
A summarizing text algorithm is also run by a processor against the documents of the document hit list to produce a table of terms having a high tf*idf (term frequency times inverted document frequency) value for each document. All of the tables are stored in a relational database, which allows the system of the present invention to generate a table of terms per document ranked by decreasing IQ. To determine whether a document is useful or useless, the table of terms and IQs, the table of most important terms per document, the table of documents with filename and lengths, and the table of high tf*idf values are examined. A document is found to be useless if one of the following two conditions is true: (i) the document has a document length of less than 2,000 bytes, or (ii) the document has less than five terms with an IQ greater than 60, the document has less than six appearances of terms having a tf*idf value of greater than 2.2, and the document has a document length of less than 40,000 bytes. The document length parameter may vary depending on the document format.
REFERENCES:
patent: 5168565 (1992-12-01), Morita
patent: 5845278 (1998-12-01), Kirsch et al.
patent: 5915249 (1999-06-01), Spencer
patent: 5943669 (1999-08-01), Numata
patent: 6070158 (2000-05-01), Kirsch et al.
patent: 6199074 (2001-03-01), Kern et al.
patent: 6233575 (2001-05-01), Agrawal et al.
patent: 6272507 (2001-08-01), Pirolli et al.
patent: 6327590 (2001-12-01), Chidlovskii et al.
Bun, Khoo Khyou et al., “Emerging Topic Tracking System”, Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems, Jun. 21-22, 2001, pp. 2-11.*
Jeong, Byeong-Soo et al., “Inverted File Partitioning Schemes in Multiple Disk Systems”, IEEE Transactions on Parallel And Distributed Systems, vol. 6, No. 2, Feb. 1995, pp. 142-153.*
Li, Xiaonong et al., “Fast Shape Retrieval Using Term Frequency Vectors”, Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Jun. 22, 1999, pp. 18-22.
Al Alam Shahid
Alam Hosain T.
Dilworth & Barrese LLP
Percello Louis J.
LandOfFree
System and method for identifying useless documents does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for identifying useless documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for identifying useless documents will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2898636