System and method for detecting a web page template

Data processing: presentation processing of document – operator i – Presentation processing of document – Hypermedia

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

07987417

ABSTRACT:
An improved system and method is provided for detecting a web page template. A web page template detector may be provided for performing page-level template detection on a web page. In general, the web page template classifier may be trained using automatically generated training data, and then the web page template classifier may be applied to web pages to identify web page templates. A web page template may be detected by classifying segments of a web page as template structures, by assigning classification scores to the segments of the web page classified as template structures, and then by smoothing the classification scores assigned to the segments of the web page. Generalized isotonic regression may be applied for smoothing scores associated with the nodes of a hierarchy by minimizing an optimization function using dynamic programming.

REFERENCES:
patent: 6256629 (2001-07-01), Sproat et al.
patent: 2004/0006452 (2004-01-01), Gluhovsky
patent: 2007/0009167 (2007-01-01), Dance et al.
patent: 2007/0255707 (2007-11-01), Tresser et al.
Niculescu-Mizil, et al., “Predicting Good Probabilities With Supervised Learning”, appearing in Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005, p. 625-632.
Angelov, et al., “Weighted Isotonic Regression under the L1 Norm”, SODA 2006, Jan. 2006, p. 783-791.
Yi et al., “Eliminating Noisy Information in Web Pages for Data Mining”, SIGKDD '03, Aug. 2003, copyright ACM, p. 1-10.
Z. Bar-Yossef and S. Rajagopalan, 'Template detection via data mining and its applications, in Proc. 11th WWW, pp. 580-591, 2002.
D. Gibson, K. Punera, and A. Tomkins, “The volume and evolution of web page templates,” in Proc. 14th WWW (Special Interest Tracks and Posters), pp. 830-839, May 2005.
L. Yi, B.Liu and X. Li, “Eliminating Noisy Information in web pages for data mining,” In Proc. 9th KDD, pp. 296-305, 2003.
L. Yi and B.Liu, “Web page cleaning for web mining through feature weighting,” In Proc. 18th IJCAI, pp. 43-50, 2003.
K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire, “A fast and robust method for web page template detection and removal,” In Proc. 15th CIKM, pp. 256-267, 2006.
H.Y. Kao, J.M. Ho, and M.S. Chen, “WISDOM: Web intrapage informative structure mining based on document object model,” TKDE, 17(5):614-627, 2005.
S. Debnath, P. Mitra, N. Pal, and C.L. Giles, “Automatic Identification of Informative Sections of Web Pages,” TKDE, 17(9):1233-1246, 2005.
H. Y. Kao, M.S. Chen, S.H. Lin, and J.M Ho, “Entropy-based link analysis for mining web informative structures,” in Proc. 11th CIKM, pp. 574-581 2002.
R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning block importance models for web pages,” In Proc. 13th WWW, pp. 203-211, 2004.
B. Davison, “Recognizing nepotistic links on the web,” In AAA1-2000 Workshop on Artificial Intelligence for Web Search, pp. 23-28, 2000.
N. Kushmerick, “Learning to remove internet advertisement,” In Proc. 3rd Agents, pp. 175-181, 1999.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for detecting a web page template does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for detecting a web page template, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for detecting a web page template will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2699461

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.