Forum web page clustering based on repetitive regions

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S729000, C707S730000, C707S731000, C715S700000

Reexamination Certificate

active

08051083

ABSTRACT:
Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.

REFERENCES:
patent: 6119124 (2000-09-01), Broder et al.
patent: 7143365 (2006-11-01), Gallella
patent: 7185001 (2007-02-01), Burdick et al.
patent: 7225397 (2007-05-01), Fukuda et al.
patent: 7293007 (2007-11-01), Ma et al.
patent: 2002/0188602 (2002-12-01), Stubler et al.
patent: 2004/0199546 (2004-10-01), Calistri-Yeh et al.
patent: 2005/0065959 (2005-03-01), Smith et al.
patent: 2005/0120006 (2005-06-01), Nye
patent: 2005/0246296 (2005-11-01), Ma et al.
patent: 2005/0267915 (2005-12-01), Zhulong et al.
patent: 2005/0278324 (2005-12-01), Fan et al.
patent: 2006/0004717 (2006-01-01), Ramarathnam et al.
patent: 2006/0143158 (2006-06-01), Ruhl et al.
patent: 2007/0174269 (2007-07-01), Jing et al.
patent: 2007/0208701 (2007-09-01), Sun et al.
patent: 2007/0208703 (2007-09-01), Shi et al.
patent: 2008/0010291 (2008-01-01), Poola et al.
patent: 2008/0010292 (2008-01-01), Poola
patent: 2008/0046441 (2008-02-01), Wen et al.
patent: 2008/0114800 (2008-05-01), Gazen et al.
patent: 2007-080061 (2007-03-01), None
“Common layout extraction from Web pages” Webpage Available at http://sciencelinks.jp/j-east/article/200121/000020012101A0644892.php.
He, et al., “ImageSeer: Clustering and Searching WWW Images Using Link and Page Layout Analysis”, Apr. 1, 2004. Technical Report MSR-TR-2004-38. 12 Pages.
Bekkerman, et al.,“Web Page Clustering using Heuristic Search in the Web Graph”, Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. Hyderabad, India, Jan. 6-12, 2007. pp. 2280-2286.
Lage, et al., “Automatic generation of agents for collecting hidden Web pages for data extraction”, Data & Knowledge Engineering 49 (2004). pp. 177-196.
Reis, et al., “Automatic Web News Extraction Using Tree Edit Distance”, WWW2004, May 17-22, 2004, New York, USA. pp. 502-511.
Guo, et al., “Board Forum Crawling: A Web Crawling Method for Web Forum”, Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings). pp. 745-748.
Crescenzi, et al., “Clustering Web pages based on their structure”, Data & Knowledge Engineering 54 (2005). pp. 279-299.
Brandman, et al., “Crawler-Friendly Web Servers”, ACM SIGMETRICS Performance Evaluation Review archive vol. 28 , Issue 2 (Sep. 2000) pp. 1-16.
Baeza-Yates, et al., “Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering”, May 10-14, 2005, Chiba, Japan. 9 Pages.
Raghavan, et al., “Crawling the HiddenWeb”, Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. 10 Pages.
Glance, et al., “Deriving Marketing Intelligence from Online Discussion”, KDD'05, Aug. 21-24, 2005, Chicago, Illinois, USA. pp. 419-428.
Manku, et al., “Detecting Near-Duplicates for Web Crawling”, WWW 2007, May 8-12, 2007, Banff, Alberta, Canada. pp. 141-149.
Yossef, et al., “Do Not Crawl in the DUST: Different URLs with Similar Text”, WWW 2007, May 8-12, 2007, Banff, Alberta, Canada. pp. 111-120.
Zhang, et al., “Expertise Networks in Online Communities: Structure and Algorithms”, WWW 2007, May 8-12, 2007, Banff, Alberta, Canada. pp. 221-230.
Henzinger, et al., “Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms”, SIGIR'06, Aug. 6-11, 2006, Seattle, Washington, USA. pp. 284-291.
Chakrabarti, et al., “Focused crawling: a new approach to topic-specific Web resource discovery”, Published by Elsevier Science B.V. 1999. pp. 1623-1640.
Rosenfeld, et al., “Information Architecture for the World Wide Web”, webpage available at http://www.oreilly.com/catalog/infotecture/.
“Category:Internet forum software” From Wikipedia the free encyclopedia, webpage available at http://en.wikipedia.org/wiki/Category:Internet—forum—software.
“Introduction to Algorithms, Second Edition” webpage available at http://mitpress.mit.edu/algorithms/.
Zheng, et al., “Joint Optimization of Wrapper Generation and Template Detection”, SIGKDD 2007, San Jose California,USA. 28 Pages.
Song, et al., “Learning Important Models for Web Page Blocks based on Layout and Content Analysis”, SIGKDD Explorations. vol. 6,Issue 2—pp. 14-23.
Datar, et al., “Locality-Sensitive Hashing Scheme Based on p-Stable Distributions”, SCG'04, Jun. 9-11, 2004, Brooklyn, New York, USA. pp. 253-262.
Baeza-Yates, et al., “Modern Information Retrieval”, webpage available at http://people.ischool.berkeley.edu/˜hearst/irbook/.
Berners-Lee, et al., “Uniform Resource Locators (URL)” RFC 1738. Dec. 1994. pp. 1-25.
“Sitemaps XML format” webpage available at http://www.sitemaps.org/protocol.php.
Zhai, et al., “Structured Data Extraction from the Web Based on Partial Tree Alignment”, IEEE Transactions on Knowledge and Data Engineering, vol. 18, No. 12, Dec. 2006. pp. 1614-1628.
Vidal, et al., “Structure-Driven Crawler Generation by Example”, SIGIR'06, Aug. 6-11, 2006, Seattle, Washington, USA. pp. 292-299.
Broder, et al., “Syntactic Clustering of the Web”, SRC Technical Note 1997-015. Jul. 25, 1997. pp. 1-13.
Brin, et al., “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Computer Networks and ISDN Systems vol. 30, Issue 1-7 (Apr. 1998) pp. 107-117.
Pandey, et al.,“User-Centric Web Crawling”, WWW 2005, May 10-14, 2005, Chiba, Japan. pp. 401-411.
Cai, et al., “VIPS: a Vision-based Page Segmentation Algorithm”, Nov. 1, 2003 Technical Report. MSR-TR-2003-79. pp. 1-29.
International Search Report and Written Opinion for PCT Application No. PCT/US2009/040881, mailed on Dec. 7, 2009, 11 pages.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Forum web page clustering based on repetitive regions does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Forum web page clustering based on repetitive regions, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Forum web page clustering based on repetitive regions will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4297723

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.