Method and apparatus for indexing documents for message...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C709S201000

Reexamination Certificate

active

06314421

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to document indexing, and more specifically, to approximate, content-based classification of documents.
BACKGROUND
There has been a great deal of work on automatic content-based document search techniques and document classifiers for various applications. Some examples of prior art document classification mechanisms are listed below.
In U.S. Pat. No. 5276741, entitled “Fuzzy string matcher,” an algorithm compares strings into which error has been introduced, using a measure of approximate similarity. However, the types of errors introduced do not include different orderings of the original message.
In U.S. Pat. No. 5375235, entitled “Method of indexing keywords for searching in a database recorded on an information recording medium,” the matching technique employs a similarity measure, based on keyword-frequency. Knowledge of likely keywords on the part of the sending parties will lead them away from such word choices when they want to present the information against the wishes of the receiver. Therefore, this method can not be used against knowing senders who want to avoid matching.
In U.S. Pat. No. 5418951, entitled “Method of retrieving documents that concern the same topic,” the document characterization algorithm uses a word n-gram weighting method. The method has the same problem seen in Patent '235. If the second party rearranges the message, the message characterization mechanism fails.
In U.S. Pat. No. 5276869, entitled “System for selecting document recipients as determined by technical content of document and for electronically corroborating receipt of document,” creates profiles of documents for matching against profiles of documents of interest to a potential receiver. However, the method assumes a limited range of document types, specifically disclosures of inventions.
In U.S. Pat. No. 5701459, entitled “Method and apparatus for rapid full text index creation,” a full text index creation algorithm is used. The method assumes no capricious or evasive reordering or rewording of text to evade searches.
In U.S. Pat. No. 5469354, entitled “Document data processing method and apparatus for document retrieval,” a search method that breaks the document into shorter character strings that are used to build an index is used. However, the method is not sensitive to common phrases, and is an optimization technique for phrase-level searching, rather than a searching technique.
In U.S. Pat. No. 5107419, entitled “Method of assigning retention and deletion criteria to electronic documents stored in an interactive information handling system,” varying criteria for deletion are suggested. However, the method requires user response and input, and is not automatic.
In U.S. Pat. No. 5613108, entitled “Electronic mail processing system and electronic mail processing method,” documents are classified and located within a file system by attempting to automatically determine the type. However, the method assumes that the senders are not trying to subvert the classification system.
Generally prior art document classification approaches start from common assumptions about the motivations for content-based document search, index and retrieval. These motivations, found on the publishing side and on the consumption side, are listed below.
The indexing and search techniques reflect, for the most part, searcher subject-matter interests, and a desire to find a uniquely fitting subset of documents. Most prior art document classifications systems are not designed to avoid unsolicited documents, or to determine whether or not a given document is truly unique.
Generally, the information provider, out of concern for managing costs, maintaining profitability, and/or maintaining a reputation for courtesy, strongly desires that the document reach only interested audiences. The information provider therefore uses the automatic indexing service to improve the chances of the document being automatically identified by such audiences. Generally, the information provider is not interested in providing many copies of the document with insignificant variations, automatically or otherwise. Such copies could be taken by searchers as frivolous reproduction of essentially the same information, a cost in consumer time, and would require resources, such as disk space, that the publisher has to pay for.
Generally, prior art document classification systems assume that the documents have relatively little time-value, in the sense that they are expected to be stored for purposes of retrieval for periods of years, and usually need not be indexed, promoted, and propagated immediately. While a timely response from the search and retrieval system is important for attracting and retaining users, there is no real-time response requirement, especially for generating document indexes. Most such systems need not index documents before some real-world event in order to be of real value to information providers and their client searchers.
Document source text with original information is assumed to be produced at human input rates. Usually prior art document search systems assume that there is a desire to make the documents available on networks and in computers using only the amount of redundancy needed for information integrity and user convenience.
There is an advantage to both publishers and searchers in using indexing schemes that are standard, consistent, and independent of time of search and particular physical repository. Indexing techniques that are opaque and variable according to time and place would defeat the purposes of interested parties. Indexing systems for document retrieval systems must be highly reliable, as a basic measure of their quality- of-service. In general, people would distrust a system that provided them with different, and incorrect, results at different times or from different sites, even occasionally.
Prior art on-line document retrieval systems still largely assume limits in both computer power and network bandwidth. The algorithms and technology still reflect these prevailing assumptions. In particular, since power and bandwidth resources were scarce, closed, and closely held, there was a low tolerance for conspicuously frivolous uses of them.
Since that time, however, dramatic improvements in computer power and network bandwidth have weakened a number of the above assumptions. The digital “information explosion” was made possible by the rapid growth of secondary storage, processing power, and public networking. But it has been followed by several kinds of “information pollution.” Computers can duplicate and propagate information much more cheaply and quickly than human beings. This has always been true, of course, but it has not been until recently that these virtues have been inexpensive enough to also provide opportunities for people to inconvenience others.
In particular, there is now information in, or appearing via, computers that concerned parties can not easily avoid, however much they might wish to. Such information is unlikely to be efficiently indexed in any database. Some promoters wishing to reach interested audiences have found ways to actively present information to many people. They show little concern for the large numbers of uninterested parties they also reach in the process. An example of this is the embedding of popular-but-irrelevant keywords in invisible text on web pages, to increase hit ratios.
One prior art method of filtering “information pollution” is by comparing every suspect message word for word against a list of messages thought to be undesirable. Such an approach can, however, be easily frustrated by automating the production of minor changes to the text. Such changes might include changing the order of phrases, sentences and paragraphs, without changing their meaning. Such permutations can be made at a cost not significantly greater than that required for simply copying the text. Only a small amount of extra text preparation effort and simple software tools are required. Given motives to do so, text perm

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for indexing documents for message... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for indexing documents for message..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for indexing documents for message... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2578905

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.