Method for learning character patterns to interactively...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06411952

ABSTRACT:

FIELD OF THE INVENTION
This invention relates generally to Web crawlers, and more particularly to learning character patterns in queries to control the scope of Web crawler searches for Web pages.
BACKGROUND OF THE INVENTION
In the context of the Internet and the World Wide Web (Web)—an application layer of the Internet, Web crawlers are software programs that search for resources, such as Web pages, by following hyperlinks that connect the pages. This activity is sometimes known as “walking” the Web. Search services, such as Digital Equipment Corporation's AltaVista service employ Web crawlers to build indexes of words contained in the pages located by the search. Typically, the scope of the search is specified in a query.
In general, prior art Web crawlers search and walk the Web without much discrimination on where the pages are located. The location or address of each Web page is specified by a string of characters called a Universal Resource Locator (URL), e.g.,
“http://www.digital.com/homepage/topic/topic.html”.
Hyperlinks are URLs embedded in Web pages to identify connected or “linked” pages. Potentially, the scope of most prior art Web crawlers is the entire Web. During a search, the links are followed to locate as many connected pages as possible.
On the other hand, an end-user crawler performs custom searches on behalf of an end-user. For example, an end-user crawler may locate and copy a subset of pages to a local disk. End-user crawlers have a restricted scope and typically walk only a small portion of the Web. For example, an end-user crawler may confine its walk to a single domain or “host” computer. That is, the scope of the search is restricted to a single or small number of sites.
Customizing an end-user crawler involves, in part, specifying the portion of the Web to be walked, that is, the scope of the search. This specification can be called a walking rule. It is possible to program very effective walking rules based on the URL character strings alone. Unfortunately, even this amount of programming is too complicated for many end-users.
In another approach, the search scope is manually defined. However, hand coded rules are tedious to write, error-prone and often sub-optimal. Learning walking rules from examples is another possibility. Users typically are incapable or unwilling to provide enough examples to infer a perfect walking rule, i.e., a rule that exactly matches the desired crawl scope.
Several end-user crawlers provide broad control on crawl scope. The MacroBot crawler (http://www.ipgroup.com/macrobot/) does not have restrictions on individual pages. However, it supports limits on depth or number of pages retrieved.
The Mapuccino/WebCutter crawler (http://www.ibm.com/java/mapuccino/) supports page restriction based on keyword matching with the textual content of the document. This approach can be time consuming, because pages have to be fetched before it can be decided whether or not the content of pages are useful to the end-user. Also, content-based walking rules do not exploit syntactic patterns in the character strings that name URLs within a given site.
The HotCargo Express crawler (http://www.documagix.com/products/hotcargo_express/) uses regular expressions to identify the URLs of pages that should be downloaded. However, the regular expressions need to be hand-coded.
There are prior art methods for inferring grammars and regular expressions in character strings from examples, such as in natural language processing. These inference methods are designed to work on a large training set without the need for much generalization.
These methods do not usually make use of negative examples, i.e., character strings that specify portions of the Web not to be searched. The grammars learned there are usually too complex to be understood or edited by end-users. Hence, given small set of examples that users are likely provide, the prior art natural language processing might not work well.
Some methods attempt to infer a regular grammar from positive examples in the context of specific types of Web content, e.g., phone numbers or e-mail addresses in Web pages. These methods assume a large training set and learn patterns that are hard for humans to verify or edit, see Goan et al. “A Grammar Inference Algorithm for the World Wide Web”, AAAI Spring Symposium, 1996, (http://www.parc.xerox.com/istl/projects/mlia/papers/goan.ps). Please see “http://www-cse.ucsd.80/users/rik/MLIA.html for a listing of work in applying machine learning to information retrieval.
Another approach to rule learning uses a decision tree. There, a tree of decision nodes is constructed from training data based on various attributes. For each node in the tree, an attribute is tested. Based on the result of the test, either a “yes” or “no” result is returned, or the decision process moves to a descendant node in the tree, where another attribute is examined similarly. In constructing such a tree, when a new node is added, the method chooses the attribute which is likely to yield the largest information gain at that stage.
An example of such a method is described by Quinlan in “Induction of decision trees,” Machine Learning, Vol. 1, pp. 81-106, 1986. For an overview of Decision Tree learning see Russell et al in Section 18.3 of “Learning Decision Trees,” Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial Intelligence, 1995.
Decision tree learning is optimized to learn trees that are either concise in their description using a “minimum description length” optimization, or are fast to evaluate. Neither of these optimizations are important criteria in the context of Web crawlers. The time required to fetch and process Web pages significantly exceeds the time required to match URLs. The decision trees learned by these methods tend to be highly nested and hard for end-users to comprehend or modify.
None of the known learning and grammar inference methods can be “tuned” to match the characteristic textual grammar of URLs. Hence, none of them incorporate biases specified in URLs.
Hence, an automated approach to generating walking rules is desirable. One goal of such an automated approach would allow end-users who are non-programmers to interactively train a crawler to walk a specific portion of the Web by giving a small set of simple examples. From the training set, a walking rule would be inferred that is comprehensible to the end-user. Then, the user could modify the input or the rule directly, to develop a walking rule that better matches the desired crawl scope. Thus, the second goal is to specifically generate rules that are comprehensible to end-users, and to support iterative refinement. The third goal is to tune the rule inference to the specific domain of URLs and exploit knowledge about URL naming. Hence, in building a walking rule based on character patterns, the system could preferentially select patterns that match actual structures of character strings that specify a target Web site, such as directories, paths, filenames and file-extensions typically found in URLs.
SUMMARY OF THE INVENTION
The invention as a first goal provides a method for interactively training a Web crawler to search the Web according to a desired scope. During training, input comprises of examples of URLs that should be walked (positive examples) and not walked (negative examples) by the crawler. This input is used to infer a walking rule.
Users typically are incapable or unwilling to provide enough examples to infer a perfect walking rule, i.e., a rule that exactly matches the desired crawl scope. Hence, it is preferable that the inferred rule is comprehensible to the end-user. This allows the user to modify the input, or the rule directly, and to develop a rule that better matches the scope of the desired crawl. Thus, a second goal is to specifically generate rules that are comprehensible to end-users, and to support iterative refinement.
The third goal is to allow the user to tune the rule inferred to a specific domain of URLs, and to exploit knowledge about URL naming. Hence,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for learning character patterns to interactively... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for learning character patterns to interactively..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for learning character patterns to interactively... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2945232

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.