Data processing: structural design – modeling – simulation – and em – Emulation
Reexamination Certificate
1998-06-18
2001-08-21
Teska, Kevin J. (Department: 2123)
Data processing: structural design, modeling, simulation, and em
Emulation
C703S022000, C703S026000, C709S224000, C714S028000
Reexamination Certificate
active
06278966
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system to simulate the behavior of visitors navigating an internet web site. More particularly, the invention concerns a generative model to simulate hypothetical traffic over a web site, and to use this traffic in emulation of actual traffic observed at the web site.
2. Description of the Related Art
In internet web site (site) applications, database logs record the movement of traffic caused by visitors traversing a site. In medium to large sites, the amount of data that accumulates on a daily to weekly basis is immense. Commonly, this data contains a great deal of information about the behaviors of visitors to the web site; however, analyzing it using conventional statistical tools is prohibitive due to the sheer volume of data.
Instead data mining tools may be used to analyze the data and to automatically “discover” interesting patterns and relationships within the data. Such data mining tools are association rule discovery methods such as those disclosed in R. Srikant et al., “Mining Generalized Association Rules,” 1995,
Proceedings of the
21
st VLDB Conference
, Zurich, Switzerland, and R. Agrawal et al., “Fast Discovery of Association Rules,” 1996,
Advances in Knowledge Discovery and Data Mining
, U. M. Fayyad et al., eds. AAAI Press/The MIT Press, Menlo Park, Calif., USA. These types of association rules can be used to identify patterns in a transaction database, where a transaction is a visitation session that occurs when a user peruses a web site. A web site server records the actions of users to the site in a “web log” database. This database is “sessionized” by identifying sequences of actions that correspond to distinct visits. Applied to such a sessionized web log, association rules can be used to discover the presence of content usage patterns (traffic flow) over a web site. Such rules may deliver statements of the form “75% of visits of referrer A belong to segment B,” or “45% of visitors to page A also visit page B.”
One problem that arises in the internet web site domain due to the sheer volume of data that can be generated by a site with heavy user traffic is that saving all this data for future reference can be prohibitively expensive. One way to reduce the size of the data is to compress it into a set of summary statistics. However, this requires considerable foresight in choosing the set of statistics and does not allow one to posit questions that are only apparent at a later date.
Although the internet is relatively new and few inventions exist for application to the internet in general much less to web sites in particular, computer science, discrete mathematics, and graph theory provide significant guidance in modeling static graphs. Given a static and completely described web page, such models can be applied to estimate the traffic flow over such a site without need to resort to a generative model or probabilistic simulation. However, characteristics of present day web sites preclude the application of such classical graph theoretic tools.
Present day web sites tend to be dynamic, not static, and cannot be completely described in advance. Web pages can be constructed dynamically, or links between pages can be created dynamically, thereby yielding a dynamic cyclic graph structure. Even web sites that are relatively static in that their design—such as websites that are stable over a span of a few weeks and do not rely upon dynamic page creation or dynamic link creations—are extremely difficult or tedious to model using conventional graph modeling tools due to the sheer size of the connected graph and the special nature of visitor behavior.
To overcome these difficulties, there is a pressing need for an invention that automates the step of “describing” a graph to a web site modeling tool, and that automatically takes into account the special nature of web site users themselves such that the model not only accounts for the topology of the web site but also accounts for regularities evident in user traffic. The invention should be capable of generating a distribution of visitor behavior that results if visitors demonstrate no preferences and were influenced mostly by the site topology. This emulated distribution could then be used as a reference distribution against which the distribution generated by actual users could be compared.
Preferably, the user characteristics processed by such an invention should also be reducible into a small number of descriptive statistics that, along with web site topography, could be used to emulate user behavior and approximate summary statistics not anticipated at the time the original data was collected. This would allow the statistics to be applied to determine “future” visitor behavior, such as how past users would behave today when navigating a site topology previously unavailable.
SUMMARY OF THE INVENTION
Broadly, the present invention concerns a method and apparatus for generating hypothetical web site traffic that simulates the behavior of actual web site users. Data Mining Association Rules may be applied to this simulated traffic and used to identify usage patterns for users of a web site, such as discussed in the U.S. patent application entitled “ASSOCIATION RULE RANKER FOR WEB SITE EMULATION” by Steven Howard et al., assigned to the assignee of the current invention, incorporated by reference herein and being filed concurrently herewith.
Further, the present invention includes a method to discount topology affected rules. For example, one may use the present invention Web Walk Emulator to generate the distribution of visitor behavior that would result if visitors demonstrated no personal preferences and were influenced mostly by the site topology alone. This “emulated” distribution can then be used as a reference distribution against which to compare the distribution generated by actual users who display personal preferences.
The present invention allows user characteristics to be compressed into a small number of descriptive statistics, which, along with the site topology, can be used to emulate visitor behavior at a later time. An example of this use is approximating novel summary statistics that were not anticipated at the time the original data was being collected.
In one embodiment, the invention may be implemented to provide a method to generate behavior for hypothetical visitors (visitors) traversing a site. This generated data emulates the behavior of actual users. The hypothetical visitors may display behavior that is indistinguishable from those of actual users, a subset of the actual users, or the behavior may be purely hypothetical, such as when a user acts without evidence of having made an intentional choice. The present invention tracks the actions of the visitors and develops reference distributions that may be compared to a site's usage distributions as obtained from actual visitors to the site. The reference distributions are then used in one embodiment of the invention to implement statistical estimation methods that measure relative information content, for example, Kullback-Liebler Information Criterion or the Bayesian criteria.
In another version of the method, the invention comprises a general implementation; another embodiment comprises a deterministic implementation. The general version may be applied to live production web sites. The deterministic version is suited to offline processing and not burdening the active web site with additional traffic. In another embodiment, this version also exploits certain types of data in order to reduce the cost of its implementation.
In another embodiment, the invention may be implemented to provide an apparatus for generating web site traffic that substantially emulates actual web site traffic. The apparatus may include storage, a processor, and an emulation system comprising various hardware components and circuitry.
In still another embodiment, the invention may be implemented to provide a signal-bearing medium tangibly embodying a program of machine-re
Howard Steven Kenneth
Martin David Charles
Plutowski Mark Earl Paul
Gray Cary Ware & Friedenrich
International Business Machines - Corporation
Sergent Douglas W.
Teska Kevin J.
LandOfFree
Method and system for emulating web site traffic to identify... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for emulating web site traffic to identify..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for emulating web site traffic to identify... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2499235