Table of contents extraction with improved robustness

Data processing: presentation processing of document – operator i – Presentation processing of document – Edit – composition – or storage control

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

07743327

ABSTRACT:
In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.

REFERENCES:
patent: 5434962 (1995-07-01), Kyojima et al.
patent: 5491628 (1996-02-01), Wakayama et al.
patent: 5832520 (1998-11-01), Miller
patent: 5923334 (1999-07-01), Luken
patent: 6298357 (2001-10-01), Wexler et al.
patent: 6336124 (2002-01-01), Alam et al.
patent: 6487566 (2002-11-01), Sundaresan
patent: 6490603 (2002-12-01), Keenan et al.
patent: 6539387 (2003-03-01), Oren et al.
patent: 2002/0143818 (2002-10-01), Roberts et al.
patent: 2003/0093760 (2003-05-01), Suzuki et al.
patent: 2003/0208502 (2003-11-01), Lin
patent: 2004/0003028 (2004-01-01), Emmett et al.
patent: 2004/0024780 (2004-02-01), Agnihotri et al.
patent: 2004/0205461 (2004-10-01), Kaufman et al.
patent: 2006/0253441 (2006-11-01), Nelson
Déjean et al., “Structuring Documents According to Their Table of Contents,” Doc. Eng. '05, Bristol, UK, Nov. 2-4, 2005.
Déjean et al., “A System for Converting PDF Documents into Structured XML Format,” 7thIAPR Workshop on Document Analysis Sytems, Nelson, New Zealand, Feb. 13-15, 2006.
Chanod et al., “From Legacy Documents to SML: A Conversion Framework,” 9thEuropean Conf. on Research and Advanced Technology for Digital Libraries, Vienna, Austria, Sep. 18-23, 2005.
Adler, S., et al., “Extensible stylesheet language (XSL), Version 1.0,” W3C 2001, http://www.w3.org/TR/2001/REC-xsl-20011015/.
Aiello, M., Monz, C., Todoran, L., Worring, M., “Document understanding for a broad class of documents”, International Journal on Document Analysis and Recognition (IJDAR), vol. 5, 2002, Springer-Verlag, pp. 1-16.
Anjewierden, A., “AIDAS: Incremental logical structure discovery in PDF documents”, Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Seattle, 2001.
Belaïd, A., Pierron, L., Valverde, N., “Part-of-speech tagging for table of contents recognition”, International Conference on Pattern Recognition (ICPR 2000), vol. 4, Sep. 3-8, 2000 Barcelona, Spain.
Dori, D., Doermann, D., Shin, C., Haralick, R., Phillips, I., Buchman, M., Ross, D., “The representation of document structure: A generic object-process analysis”, Chapter XX,Handbook on Optical Character Recognition and Document Image Analysis, World Scientific Publishing Company, 1995/1996, pp. 000-000.
Dori, D., Doermann, D., Shin, C., Haralick, R., Phillips, I., Buchman, M., Ross, D., “The representation of document structure: A generic object-process analysis”, Chapter 16,Handbook of Character Recognition and Document Image Analysis, World Scientific Publishing Company, 1997, pp. 421-456.
Klink, S., Dengel, A., Kieninger, T., “Document structure analysis based on layout and textual features”, Pcroceedings of Fourth IAPR International Workshop on Document Analysis Systems, DAS 2000, Rio de Janeiro, Brazil, 2000, pp. 99-111.
U.S. Appl. No. 11/032,817, filed Jan. 10, 2005, DeJean et al.
U.S. Appl. No. 11/033,016, filed Jan. 10, 2005, Dejean et al.
U.S. Appl. No. 11/116,100, filed Apr. 27, 2005, Dejean et al.
U.S. Appl. No. 11/032,814, filed Jan. 10, 2005, Dejean et al.
U.S. Appl. No. 11/137,566, filed May 26, 2005, Meunier.
U.S. Appl. No. 10/756,313, filed Jan. 14, 2004, Chidlovskii et al.
Lin, C.C., Niwa, Y., Narita, S., “Logical structure analysis of book document images using contents of information”, 4thInternational Conference on Document Analysis and Recognition (ICDAR'97), Ulm, Germany, Aug. 1997, pp. 1048, 1054.
Lin, X., “Header and footer extraction by page-association”, Hewlett-Packard Company Technical Report, 2002, www.hpl.hp.com/techreports/2002/hpl-2002-129.pdf.
Lin, X., “Text-mining based journal splitting”, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), vol. II, Aug. 3-6, 2003, Edinburgh, Scotland.
Lin, X., Simske, S., “Automatic document navigation for digital content re-mastering”, SPIE COnference on Document Recognition and Retrieval XI, Jan. 18-22, 2004, San Jose, CA.
Power, R., Scott, D., Bouayad-Agha, N., “Document Structure”, Computational Linguistics, vol. 29, No. 2, 2003, pp. 211-260.
Satoh, S., Takasu, A., Katsura, E., “An automated generation of electronic library based on document image understanding”, Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR'95), vol. 1, Aug. 14-15, 1995, Tokyo, Japan, pp. 163-166.
Summers, K.M., “Automatic discovery of logical document structure”, PhD thesis, Cornell University, Computer Science Department, Aug. 1998, pp. 1-181.
Virk, R., “Converting PDF files into XML”,CambridgeDocs, 2004, www.cambridgedocs.com.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Table of contents extraction with improved robustness does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Table of contents extraction with improved robustness, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Table of contents extraction with improved robustness will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4162546

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.