Information retrieval system and method that generates weighted

Data processing: database and file management or data structures – Database design – Data structure types

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

707 2, 707 3, 707 4, G06F 1730

Patent

active

061673986

DESCRIPTION:

BRIEF SUMMARY
BACKGROUND OF THE INVENTION

1. Field of the Invention
The present invention relates to information retrieval, and particularly, but not exclusively, to an Internet information agent which analyses candidate documents for dissimilarity with a reference corpus identified by a user of the agent.
2. Related Art
In the art of information retrieval it is known for a user to specify the initial conditions for retrieval by means of a set of keywords. Various search engines are known which have search languages adapted for advanced searching using Boolean operators for combining keywords.


SUMMARY OF THE INVENTION

According to a first aspect of the present invention there is provided a method of information retrieval comprising the steps of: with a first predetermined function and producing a first output, predetermined function and producing a second output, the second output being referred to as a dissimilarity measure and being indicative of the degree of dissimilarity between the analysed part of the reference corpus and the analysed part of the retrieved text, and of dissimilarity less than a predetermined degree of dissimilarity.
It will be appreciated that the larger the value of the dissimilarity measure the greater the degree of dissimilarity between the analysed part of the reference corpus and the analysed part of the retrieved text, and, conversely, the smaller the value of the dissimilarity measure the lesser the degree of dissimilarity between the analysed part of the reference corpus and the analysed part of the retrieved text. In other words, the dissimilarity measure will have a zero value if the two documents are identical.
The present invention differs from the above prior art retrieval techniques in that the user provides a reference corpus (a start document) as an example of the type of document that the user would like to find, and the method of the invention, as performed by an information retrieval agent, analyses the reference corpus in accordance with one or more of a range of metrics, these relating to word (term) frequency of the title of the candidate document, character-level n-gram frequency, word frequency of the whole text of the candidate document, and word-level n-gram language model. The greater the combination of the metrics, the better does the agent perform.
A method of the present invention can be used for information retrieval on demand by a user, or may be used to improve a language model used in a speech application, for example a speech recognition application.
Preferably, the analysed part of said retrieved text is the title of the candidate document.
Preferably, the first predetermined function comprises the steps of: reference corpus, the first TFL, and respective elements, each of which elements is the term frequency, TF, of a respective term of the first TFL multiplied by its corresponding IDF, TFIDF, said first corresponding vector constituting said first output; and wherein the second predetermined function comprises the steps of: respective elements, each of which elements is the TF of a respective term of the second TFL, and second vector, said difference measure constituting a said dissimilarity measure.
Alternatively, or additionally, the first predetermined function comprises generating a first character-level n-gram frequency list having n-grams from bigrams up to m-grams, where m is a predetermined integer, said first character-level n-gram frequency list constituting said first output, or as the case may be, a component of said first output; character-level n-gram frequency list having n-grams from bigrams up to m-grams, and performing rank-based correlation process between said first and said second character-level n-gram frequency lists and obtaining a correlation result, the correlation result constituting said dissimilarity measure, or, as the case may be, a respective component of said dissimilarity measure and, in this latter case, the difference measure of said vectors constitutes another respective component of said dissimilarity measure.
Alterna

REFERENCES:
patent: 5625767 (1997-04-01), Bartell et al.
patent: 5724571 (1998-03-01), Woods
patent: 5873076 (1999-02-01), Barr et al.
patent: 5907839 (1999-05-01), Roth
patent: 5937422 (1999-08-01), Nelson et al.
W. Bruce Croft, Intelligent Internet Services Effective Text Retrieval Based on Combining Evidence from the Corpus and Users, vol. 10 issue 6 IEEE electronic library online, pp.59-63, Dec. 1995.
Besancon et al., Textual Similarities Based on a Distributional Approach, IEEE electronic library online, p. 180-184, Sep. 1999.
Chapter 4 of the book "Introduction to Modern Information Retrieval" by G. Salton, published by McGraw Hill, 1983.
Dunning, "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, vol. 19, No. 1, 1993.
Katz, "Estimation of Probabilities from Sparse Data", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35, 1987.
Jelinek, "Self-Organised Language Modelling for Speech Recognition", Readings in Speech Recognition, edited by A. Waibel and K. Lee, published by Morgan Kaufmann, 1990.
Pearce et al, Generating a Dynamic Hypertext Environment with n-gram Analysis, Proceedings of the International Conference on Information and Knowledge Management CIKM, Nov. 1, 1993, pp. 148-153, XP000577412.
Wong et al, "Implementations of Partial Document Ranking Using Inverted Files", Information Processing & Management (Incorporating Information Technology), vol. 29, No. 5, Sep. 1993, pp. 647-669, XP002035616.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Information retrieval system and method that generates weighted does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Information retrieval system and method that generates weighted , we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Information retrieval system and method that generates weighted will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-1005877

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.