Compressed document matching

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S168000, C382S173000, C382S181000, C382S232000, C382S276000, C345S427000

Reexamination Certificate

active

06363381

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of document management, and more particularly to detecting duplicate documents.
BACKGROUND OF THE INVENTION
With the increased ease of creating and transmitting electronic document images, it has become common for document images to be maintained in database systems that include automated document insertion and retrieval utilities. Consequently, it has become increasingly important to be able to efficiently and reliably determine whether a duplicate of a document submitted for insertion is already present in a database. Otherwise, duplicate documents will be stored in the database, needlessly consuming precious storage space. Determining whether a database contains a duplicate of a document is referred to as document matching.
In currently available image-content based retrieval systems, color, texture and shape features are frequently used for document matching. Matching document images that are mostly bitonal and similar in shape and texture poses different problems.
A common document matching technique is to perform optical character recognition (OCR) followed by a text based search. Another approach is to analyze the layout of the document and look for structurally similar documents in the database. Unfortunately, both of these approaches require computationally intensive page analysis. One way to reduce the computational analysis is to embed specially designed markers in the documents, that the documents can be reliably identified.
Recently, alternatives to the text based approach have been developed by extracting features directly from images, with the goal of achieving efficiency and robustness over OCR. An example of such a feature is word length. Using sequences of word lengths in documents as indexes, matching documents may be identified by comparing the number of hits in each of the images generated by the query. Another approach is to map alphabetic characters to a small set of character shape codes (CSC's) which can be used to compile search keys for ASCII text retrieval. CSC's can also be obtained from text images based on the relative positions of connected components to baselines and x-height lines. In this way CSC's can be used for word spotting in document images. The application of CSC's has been extended to document duplicate detection by constructing multiple indexes using short sequences of CSC's extracted from the first line of text of sufficient length.
A significant disadvantage of the above-described approaches is that they are inherently text line based. Line, word or even character segmentation must usually be performed. In one non-text-based approach, duplicate detection is based on horizontal projection profiles. The distance between wavelet coefficient vectors of the profiles represents document similarity. This technique may out-perform the text-based approach on degraded documents and documents with small amounts of text.
Because the majority of document images in databases are stored in compressed formats, it is advantageous to perform document matching on compressed files. This eliminates the need for decompression and recompression and makes commercialization more feasible by reducing the amount of memory required. Of course, matching compressed files presents additional challenges. For CCITT Group 4 compressed files, pass codes have been shown to contain information useful for identifying similar documents. In one prior-art technique, pass codes are extracted from a small text region and used with the Hausdorff distance metric to correctly identify a high percentage of duplicate documents. However, calculation of the Hausdorff distance is computationally intensive and the number of distance calculations scales linearly with the size of database.
SUMMARY OF THE INVENTION
A method and apparatus for determining if a query document matches one or more of a plurality of documents in a database are disclosed. A bit profile of the query document is generated based on the number of bits required to encode each of a plurality of rows of pixels in the document. The bit profile is compared against bit profiles associated with the plurality of documents in the database to identify one or more candidate documents. Endpoint features are identified in the query document and a set of descriptors for the query document are generated based on locations of the endpoint features. The set of descriptors generated for the query document are compared against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents.
Other features and advantages of the invention will be apparent from the accompanying drawings and from the detailed description that follows below.


REFERENCES:
patent: 4292622 (1981-09-01), Henrichon, Jr.
patent: 4809081 (1989-02-01), Linehan
patent: 4985863 (1991-01-01), Fujisawa et al.
patent: 5278920 (1994-01-01), Bernzott et al.
patent: 5351310 (1994-09-01), Califano et al.
patent: 5465353 (1995-11-01), Hull et al.
patent: 5579471 (1996-11-01), Barber et al.
patent: 5636294 (1997-06-01), Grosse et al.
patent: 5751286 (1998-05-01), Barber et al.
patent: 5761655 (1998-06-01), Hoffman
patent: 5768420 (1998-06-01), Brown et al.
patent: 5806061 (1998-09-01), Chaudhuri et al.
patent: 5809498 (1998-09-01), Lopresti et al.
patent: 5867597 (1999-02-01), Peairs et al.
patent: 5870754 (1999-02-01), Dimitrova et al.
patent: 5892808 (1999-04-01), Goulding et al.
patent: 5893095 (1999-04-01), Jain et al.
patent: 5915250 (1999-06-01), Jain et al.
patent: 5930783 (1999-07-01), Li et al.
patent: 5933823 (1999-08-01), Cullen et al.
patent: 5940824 (1999-08-01), Takahashi
patent: 5940825 (1999-08-01), Castelli et al.
patent: 5987456 (1999-11-01), Ravela et al.
patent: 5995978 (1999-11-01), Cullen et al.
patent: 6006226 (1999-12-01), Cullen et al.
patent: 6026411 (2000-02-01), Delp
patent: 6086706 (2000-07-01), Brassil et al.
patent: 6104834 (2000-08-01), Hull
“Duplicate Document Detection in DocBrowse”, SPIE Conference on Document Recognition V, Mathsoft Data Analysis Products Division, V. Chalana, A. Bruce, T. Nguyen, 1998, pp. 169-178.
“Spotting Phrases in Lines of Imaged Text”, SPIE Conference on Document Recognition, Xerox Palo alto Research Center, F. Chen, D. Bloomberg, L. Wilcox, 1995, pp. 256-269.
“Detecting and Locating Partially Specified Keywords in Scanned Images Using Hidden Markov Models”, Proceedings of ICDAR, Xerox Palo Alto Research Center, 1993, F. Chen, L. Wilcox, D. Bloomberg, pp. 133-138.
“The Detection of Duplicates in Document Image Databases”, Technical Report CS-TR-3739, University of Maryland, D. Doermann, H. Li, O. Kia, 1997, pp. 314-318.
“The Retrieval of Document Images: A Brief Summary”, Proceedings of the 4thICDAR, University of Maryland, D. Doermann, 1997, pp. 945-949.
“Document Matching on CCITT Group 4 Compressed Images”, SPIE Conference on Document Recognition IV, Ricoh California Research Center, J. Hull, 1997, pp. 82-87.
“Document Image Similarity and Equivalence Detection”, International Journal on Document Analysis and Recognition, Ricoh California Research Center, J. Hull, 1998, vol. 1, No. 1, pp. 37-42.
“International Digital Facsimile Coding Standards”, Proceedings of the IEEE, vol. 68, No. 7, R. Hunter, A. Robinson, 1980, pp. 854-867.
“Skew Determination in CCITT Group 4 Compressed Document Images”, Proceedings of the Symposium on Document Analysis and Information Retrieval, Xerox, A. Spitz, 1992, pp. 11-25.
“Using Character Shape Coding For Information Retrieval”, Proceedings of the 4thICDAR, Dublin City University, A. Smeaton, A. Spitz, 1997, pp. 974-978.
“CCITT Compression”, downloaded from www.advent.co.uk/ccitt.html, Advent Imaging, Nov. 1, 1998, pp. 1-3.
“CD-ROM Document Database Standard”, Proceedings of the 2ndICDAR, I. Phillipes, S. Chen, R. Haralick, 1993, pp. 478-483.
“Efficient and Effective Querying by Image Content”, Journal of Intelligent Information Systems, vol. 3, 1994, pp. 231-262.
“Document Image Matching

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Compressed document matching does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Compressed document matching, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Compressed document matching will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2826322

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.