Method for document comparison and classification using...

Image analysis – Image segmentation

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S175000, C382S176000, C382S177000, C382S180000, C382S199000, C382S202000, C382S218000, C382S224000, C382S225000

Reexamination Certificate

active

06542635

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to document processing. Specifically a new method is taught using document layout data to compare, and/or classify documents by type.
BACKGROUND OF THE INVENTION
In various applications it is desirable to classify documents by their type, e.g., business letter, article, fax cover sheet etc. Obviously, documents can be classified as belonging to any identifiable type. Applications calling for document classification include database management and document routing. Prior art methods of document classification involve three general steps before document types can be identified. First each document is preprocessed, which often includes page segmentation. Page segmentation is the process by which neighboring characters in the document are grouped into blocks of text and white space. Second, one of a variety of known optical character recognition (“OCR”) methods is applied to part of, or the entire, document. Finally, keywords are sought out from each document, which reflect the document type from which the words were extracted.
The prior art classification processes are relatively inefficient because they require substantial OCR, which is costly in resources (memory, computational load and time). This is especially true where only a relatively small portion of the document is required to identify its type. Moreover, it may be that the user requires OCR only for a particular document type, yet OCR must be applied to all of the documents in a database to determine the set of documents characterized by the desired type. Accordingly, it would be advantageous to have a method for comparing and classifying document types without requiring OCR.
SUMMARY OF THE INVENTION
In accordance with the present invention, a new method uses document layout information to classify a document type. Documents are first processed with a page segmentation method to obtain blocks of data. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length vector scheme, referred to herein as interval encoding, to represent each row on the segmented document. Using this new vector scheme, documents can be compared using a warping function to compute the relative interval distances of two or more documents. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets.
Documents may also be classified by type using an extension of the foregoing layout scheme without requiring OCR. In this embodiment are arbitrary number of clusters are formed for grouping interval encoding vectors. Each cluster is identified with a cluster center vector which relates to the interval encoding vectors of that group. A document to be processed in accordance with the present invention, is first, as with document comparison, segmented into data blocks. Interval encoding in accordance with the present invention is then performed on the segmented document. Thereafter, each interval encoding vector in the document is replaced with the cluster center for the cluster to which it belongs.
All desirable document types for classification are modeled based on a Hidden Markov Model (“HMM”). Using known algorithms, new documents are compared with the document type models to classify all documents by the model types.
Furthermore, based on the classification, it is a simple matter to locate which blocks of data contain certain information. Where only that information is desired, it is not necessary to perform OCR on the entire document. Rather OCR may be limited to those blocks where the particular information is expected based on the document type. For example, suppose it was desired to organize all business letters by addressee. A business letter has a predictable format with the addressee information found in the left upper third of the first Page of the document. Once the document is identified by layout to be a business letter, it is an easy matter to then examine only the left upper third of the document to recognize the addressee. There is no need to perform character recognition on the entire document before identifying the addressee.


REFERENCES:
patent: 5687253 (1997-11-01), Huttenlocher et al.
patent: 5745600 (1998-04-01), Chen et al.
patent: 5757963 (1998-05-01), Ozaki et al.
patent: 5999664 (1999-12-01), Mahoney et al.
patent: 6249604 (2001-06-01), Huttenlocher et al.
patent: 6356864 (2002-03-01), Foltz et al.
Henry S. Baird, “Background Structure In Document Images”, International Journal of Pattern Recognition and Artificial Intelligenc vol. 8 No. 5 (1994) 1013-1030.
Wei Zhu, “Image Organization and Retrieval using a Flexible Shape Model”, pp 31-39 1997 IEEE.
Andreas Dengel and Gerhard Barth, “High Level Document Analysis Guided by Geometric Aspects”, International Journal of Pattern Recognition and Artificial Intelligence vol. 2 No. 4 (1988) 641-655 (1988).
Hanno Walischewski, “Automatic Knowledge, Acquistion for Spatial Document Interpretation”, pp. 243-247 1997 IEEE.
David Doermann, Huiping Li and Omid Kia, “The Detection of Duplicates in Document Image Databases”, pp. 314-318, 1997 IEEE.
Jonathan J. Hull and John F. Cullen, “Document Image Similarity and Equivalence Detection”, pp. 308-312 1997 IEEE.
R.S. Kashi, J. Hu, W.L. Nelson, “On-line Handwritten Signature Verification using Hidden Markov Model Features”.
Douglas E. Critchlow, “Metric Methods for Analyzing Partially Ranked Data,” in 34 Lecture Notes in Statistics (D. Brillinger, et al eds. 1985).
Allen Gersho, Robert M. Gray, “Vector Quantization and Signal Compression” (1992).
Lawrence Rabiner, Biing-Hwang Juang, “Fundamentals of Speech Recognition” (1993).
Jianying Hu, Michael K. Brown, and William Turin, “HMM Based On-Line Handwriting Recognition”, IEEE Transactions on Patern Analysis and Machine Intelligence, vol. 18, No. 10, Oct. 1996.
John D. Ferguson, “Variable Duration Models for Speech” pp. 143-147.
Hiroaki Sakoe and Seibi Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word Recognition”, IEEE Transactions on Acoustics, Speech and Signal Processing vol. ASSP-26, No.1 pp. 43-49 (1978).
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J. Kruskal eds. 1983).
William Turin, “Digital Transmission Systems Performance Analysis and Modeling”, (McGraw-Hill).
S.E. Levinson, “Continuously variable duration hidden Markov models for automatic speech recognition”, Computer Speech and Language pp. 29-45 (1986).

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for document comparison and classification using... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for document comparison and classification using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for document comparison and classification using... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3062324

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.