Identification, separation and compression of multiple forms...

Image analysis – Pattern recognition – Classification

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S176000, C382S180000, C382S217000, C382S225000, C382S305000, C382S306000, C358S403000, C715S252000, C715S252000, C715S252000

Reexamination Certificate

active

06640009

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to document image processing, and specifically to methods for recognition of preprinted form documents and extraction of information that is filled into them.
BACKGROUND OF THE INVENTION
In many document imaging systems, large numbers of forms are scanned into a computer, which then processes the resultant document images to extract pertinent information. Typically, the forms comprise preprinted templates, containing fields that have been filled in by hand or with machine-printed characters. To extract the information that has been filled in, the computer must first identify the template. Various methods of image analysis are known in the art for these purposes. One such method is described in U.S. Pat. No. 5,434,933, whose disclosure is incorporated herein by reference.
In order to precisely identify the location of fields in the template, a common technique is for the computer to register each document image with a reference image of the template. Once the template is registered, it can be dropped from the document image, leaving only the handwritten or printed characters in their appropriate locations on the page. For example, U.S. Pat. Nos. 5,182,656, 5,191,525, 5,793,887, and 5,631,984,whose disclosures are incorporated herein by reference, describe methods for registering a document image with a form template so as to extract the filled-in information from the form. After drop-out of the template, the characters remaining in the image are typically processed using optical character recognition (OCR) or other techniques known in the art. Dropping the template from the document image is also crucial in compressing the image, reducing substantially the volume of memory required to store the image and the bandwidth required to transmit it. For example, U.S. Pat. No. 6,020,972, whose disclosure is incorporated herein by reference, as well as the above-mentioned U.S. Pat. No. 5,182,656, describe methods for compressing document images based on template identification. The template itself need be stored and/or transmitted only once for an entire group of images of filled-in forms.
Methods of template registration and drop-out that are known in the art generally require the template to be known before compression or other processing can take place. The computer must be informed of the template type or be able to select the template from a collection of templates that are known in advance. In other words, the computer must have on hand the appropriate empty template for every form type that it processes. However, it frequently happens that not all templates or template variations are known at start-up. Furthermore, experience shows that in most systems, there is not a single template for all form types, but rather several, and that unexpected template variations may occur that cannot be distinguished by any combination of the global features that are currently used for form recognition. In the context of the present patent application and in the claims, such template variants are referred to as “mutants.”
Thus, in form processing systems known in the art, it is generally not possible to use template drop-out in the presence of such mutants, without costly involvement by a human operator in identifying the template to use for each form.
SUMMARY OF THE INVENTION
In preferred embodiments of the present invention, a document image processing system receives images of filled-in forms, at least some of which are based on templates that are not known in advance. The system automatically aligns and sorts these images into groups having similar template features, using any suitable method known in the art. Each such group, however, may contain multiple mutant templates, differing in one or more of their features. The present invention provides novel methods for recognizing these mutants and sorting the images in each group accordingly into precise sub-groups, or classes, each with its own mutant template. Preferably, the mutant template in each class is then extracted and dropped out of the images, thus enabling optimal image compression and other subsequent processing.
In order to distinguish the mutants in a given group one from another, the system preferably generates a gray-scale accumulation image by combining the images in the group. This accumulation image is then analyzed in order to distinguish areas that belong to the template common to all of the images from areas in which variations occur from image to image. These variations are further analyzed to determine, in each area, whether they are due to mutations of the template or to content filled into the individual forms. When it is determined that the variation in a given area is due to template mutation, the images in the group are sorted into mutant sub-groups according to their content in this area, which is referred to herein as a reference area. Typically, a sub-group created by sorting the original group on one reference area may then be subdivided into smaller sub-groups by sorting it on another reference area. This sorting process preferably continues until substantially all of the images have been divided into mutant sub-groups, each having its own template that is common to all of the images in the sub-group.
Preferably, after the sorting is completed, the respective template for each sub-group is extracted from one of the images and is dropped out of all of the images in the sub-group. The images are then automatically processed by compression, OCR and/or other document processing methods known in the art. Preferably, the extracted template is stored in a library for use in processing subsequent forms. The ability provided by preferred embodiments of the present invention to recognize and sort all mutants allows the images to be processed efficiently, reducing both the required storage volume and the costs of manual processing in dealing with large numbers of forms.
Although the preferred embodiments described herein relate to processing of images of form documents, the principles of the present invention may similarly be applied in extracting information from groups of images of other types, in which the images in a group contain a common, substantially fixed part along with individual, variable parts.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for processing images, including:
receiving a group of the images having similar characteristics, the group including multiple classes, such that each image belongs to one of the classes and includes a fixed portion common to all of the images in the class to which it belongs, and a variable portion, which distinguishes the image from the other images in the class;
finding a reference area in the images, in which the fixed portion of the images in a first one of the classes differs consistently from the fixed portion of the images in a second one of the classes; and
sorting the images into the classes based on the reference area.
Preferably, receiving the group of the images includes processing the images to determine the characteristics thereof, and selecting the images for inclusion in the group by finding a similarity in the characteristics.
Further preferably, the characteristics include image features recognizable by a computer, and receiving the group of the images includes mutually aligning the images in the group responsive to the features. In a preferred embodiment, the images include images of form documents, and the fixed portion of the images includes form templates, and wherein the features include features of the templates.
Preferably, finding the reference area includes:
classifying a plurality of areas of the images into areas of a first type, in which substantially all of the images in the group are substantially the same, a second type, in which a sub-group of the images in the group, but not all of the images in the group, are substantially the same, and a third type, in which substantially all of the images in the group are differe

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Identification, separation and compression of multiple forms... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Identification, separation and compression of multiple forms..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identification, separation and compression of multiple forms... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3140103

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.