System and method for topic-based document analysis for...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S245000

Reexamination Certificate

active

06751614

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to online information filtering in general, and more particularly, information filtering of Web or Intranet searching results. The method employs a rich supervised learning paradigm by accepting relevance feedback to cluster the information and more particularly it employs an efficient user-interaction-based method of text representation and cluster neighbourhood analysis providing a personalized information filtering for online search applications.
2. The Background of the Invention
The problem of information overload is overwhelming almost every Web surfer or a user scouring information from an Intranet. Information seekers on the Internet go through one or the other search engine to research a topic they are interested in. It is estimated that there are over 2500 search services. Some are directory-based where the user would drill down various levels of pre-classified information to arrive at one or two documents they might be interested in. The others are keyword-driven search engines where the user specifies the keywords that drive the search process and the search engine brings up numerous results, which the user has to browse and find out where they are of any relevance. Most search services have a combination of both.
Most of these search engines offer little or no personalized features. The user would be treated as an anonymous visitor who gets inundated with a lot of irrelevant information. For instance, if the user is searching for information about the interest, Cricket, obviously all the documents that are related to the game of Cricket will be non-sense to the user. But most search engines have little or no feature to enable the user to specify and interact with the search facility such that they get the right type of information. Instead, if we could somehow recognise the user (as if we know his or her interests) under a Topic profile, it would be a lot more effective in getting accurate information seen by the user.
Another big limitation with most search engines is that the amount of time and expertise spent in researching a subject area is never remembered. There is nothing like a “stop and resume” interface. The work involved in researching and judging documents as relevant and irrelevant has to be repeated over and over each time the search engine is used to look for information in that subject area.
Publicly indexed information available to a Web user is exploding as days pass by. A typical search engine throws up hundreds of results for a user query. A very good document could be at the bottom of the pile. Not all of these hundreds of results will be useful to the user. Instead the user would like the information to be presented in a classified manner either by relevancy or by the nature of the concept the documents cover and the concepts the user likes.
DESCRIPTION OF RELATED ART
Information filtering algorithms are designed to sort through large volumes of dynamically generated information and present the user with those that are likely to satisfy his/her information requirement. With the growth of the Internet and other networked information, research in the development of information filtering algorithms has exploded in recent years. A number of ideas and algorithms have emerged.
Some of the earlier approaches have adopted what is known as the classical supervised learning paradigm. In this paradigm, when a new icon (document) arrives, the learning agent suggests a classification, the supervisor (user) would provide a classification, and the difference is used to adjust parameters of the learning algorithm. In such a paradigm, the agent's classification and the user's classification can be independent processes. The user can also give a classification even before seeing the agent's classification.
Learning itself can be either “supervised” or “unsupervised”. In supervised learning networks the input and the desired output are both submitted to the network and the network has to “learn” to produce answers as close as possible to the correct answer. In an unsupervised learning network the answer for each input is omitted and the networks have to learn the correlation between input data and to organise inputs into categories from these correlations.
Supervised learning is a process that incorporates an external teacher. It employs Artificial Neural Networks that are particularly good at dealing with such ill-structured documentation handling and classification tasks that are usually characterised by a lack of pre-defined rules. The network is given a set of training patterns and the outputs are compared with desired values. The weights are modified in order to minimise the output error. Supervised algorithms rely on the principle of minimal disturbance, trying to reduce the output error with minimal disturbance to responses already learned.
The application of supervised learning paradigms will improve the performance of a search system. While an unsupervised approach may be easier to implement, since it does not require external intervention, a supervised approach could provide much better results in situations where a thesaurus or a knowledge base already exists or when a human expert can interact with the system. The objective is to employ neural techniques to add the “intelligence” needed in order to fulfil the user requirements better. Systems employing these models exhibit some of the features of the biological prototypes such as the capability to learn by example and to generalise beyond the training data.
Both supervised and unsupervised approaches rely upon a technique of document representation. It is a numerical representation of the document, which is used to produce an ordered document map.
One of the standard practices of document representation in information retrieval (IR) systems is the Vector Space information paradigm. This approach encodes the document set to generate the vectors necessary to train the document map. Each document is represented as a vector (V) of weights of keywords identified in the document. The word weight is calculated using the Term Frequency*Inverse Document Frequency (TFIDF) scheme which calculates the “interestingness” value of the word. Such formulae are used to calculate word weights and used to train the networks to create the information map.
Document representation techniques are used in the classification of textual documents by grouping (or clustering) similar concepts/terms as a category or topic, a process calling for cluster analysis. Two approaches to cluster analysis exist: the serial, statistical approach and the parallel, neural network approach.
In the serial approach, classes of similar documents are basically found by doing pairwise comparisons among all of the key elements. This clustering technique is serial in nature in that pairwise comparisons are made one at a time and the classification structure is created in a serial order. The parallel neural network approach is based on establishing multiple connections among the documents and clusters of documents allowing for independent, parallel comparisons.
A significant number of text-based classification algorithms for documents are based on supervised learning techniques such as Bayesian probability, decision trees or rule induction, linear discriminant analysis, logistic regression, and backpropagation-like neural networks
In spite of so many complex techniques researched to solve the problem of information filtering, the process of Searching (esp. on the Internet) is yet an unresolved problem.
Accordingly it is the object of our invention to make an attempt and provide a more effective and efficient neural network based supervised learning process that learns incrementally as documents arrive and the user grades them by providing feedback to the learning agent. The technique described in this invention can be called Supervised Clustering Analysis.
SUMMARY OF THE INVENTION
There is a need for personalization of the information searching either

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for topic-based document analysis for... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for topic-based document analysis for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for topic-based document analysis for... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3363681

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.