System and method for data extraction from digital images

Image analysis – Image segmentation – Distinguishing text from other regions

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C358S462000, C382S305000, C707S793000

Reexamination Certificate

active

06400845

ABSTRACT:

CROSS-REFERENCE TO RELATED INVENTIONS
Not Applicable
STATEMENT REGARDING FEDERALLY FUNDED RESEARCH
Not Applicable
REFERENCE TO A MICROFICHE INDEX
Not Applicable
COPYRIGHT NOTICE
Copyright 1999 Computer Services, Inc. A portion of the disclosure of this patent document contains materials which are subject to copyright protection. The owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all rights, copyright rights whatsoever.
BACKGROUND OF THE INVENTION
1. Field Of The Invention
This invention generally relates to systems and methods for the extraction of data from digital images and more particularly, to a system and method for the extraction of textual data from digital images.
2. Background Information
Systems are known which import data from scanned paper documents. Typically, these systems identify by physical location a data field in a scanned image of a blank document. When the system scans documents conforming to that blank document type, the data field location information is used to identify the area in the scanned document where the corresponding data appears and that data is then converted from bit mapped image data to text data for storage in a database.
In U.S. Pat. No. 4,949,392 entitled “Document Recognition and Automatic Indexing for Optical Character Recognition,” issued Aug. 14, 1990, preprinted lines appearing on the form are used to locate text data and then the pre printed lines are filtered out of the image prior to optical character recognition processing. In U.S. Pat. No. 5,140,650 entitled “Computer Implemented Method for Automatic Extraction of Data from Printed Forms,” issued Aug. 18, 1992, lines in the image of a scanned document or form are used to define a data mask based on pixel data which is then used to locate the text to be extracted. In U.S. Pat. No. 5,293,429 entitled “System and Method for Automatically Classifying Heterogeneous Business Forms,” issued Mar. 8, 1994, the system uses a definition of lines within a data form to identify fields where character image data exists. Blank forms are used to create a form dictionary which is used to identify areas in which character data may be extracted. In U.S. Pat. No. 5,416,849 entitled “Data Processing System and Method for Field Extraction of Scanned Images of Document Forms,” issued May 16, 1995, the system of document image processing uses Cartesian coordinates to define data field location. In U.S. Pat. No. 5,815,595 entitled “Method and Apparatus for Identifying Text Fields and Checkboxes in Digitized Images,” issued Sep. 29, 1998, a system locates data fields using graphic data such as lines. In U.S. Pat. No. 5,822,454 entitled “System and Method for Automatic Page Registration and Automatic Zone Detection during Forms Processing,” issued Oct. 13, 1998, the system uses positional coordinate data to identify areas within a scanned document for data extraction. In U.S. Pat. No. 5,841,905 entitled “Business Form Image Identification Using Projected Profiles of Graphical Lines and Text String Lines,” issued Nov. 24, 1998, the system uses cross-correlation of graphical image data to identify a form and the areas within the form for data extraction.
Each of these patents discloses a system which uses graphical data to identify forms or regions within a form for data extraction. By relying on graphical data to identify areas for data extraction, should additional or the wrong type of textual data be present in such areas that data will also be extracted and stored. It would be advantageous to be able to determine if the extracted data matches the type of data that is expected to be on the form. Also if the data were of the correct type but mispositioned somewhat with respect to its expected position on the document, it would be advantageous to be able to locate and extract such the mispositioned data. Further where data is on multiple pages, such as a two page phone bill, with the systems mentioned above, each page of the phone bill that looks different would have to be defined as a new template. It would be advantageous to have a system that can process data from multiple page forms without requiring additional preprocessing effort.
SUMMARY OF THE INVENTION
The present invention is a system and method for the extraction of textual data from digital images using a predefined pattern of visible and invisible characters contained in the textual data. The system comprises an image mapper, a template mapper, a zone optical character reader (OCR), a zone pattern comparator and data extractor, an extracted data parser and datastore. The datastore comprises a master document image database comprised of at least one table containing at least one master document image, a template database and an extracted data database. The template database comprises at least one table comprising at least one template associated with a master document image. The template has at least one zone and associated with each zone is a unique pattern comprised of one or more data segments. Each data segment comprises a predefined sequence of visible and invisible characters, with selected ones of the data segments being associated with an extracted data field in an extracted database record. The extracted data database comprises at least one table of extracted database records and each record comprises at least one data field for storing textual information extracted from the digital image.
The image comparator receives from the master document image database in the datastore a master document image for comparison with a digital image. The image comparator provides an output indicative of the success of the comparison. The template mapper, on receiving the image comparator output indicating a successful comparison, retrieves from the template database in the datastore the template associated with the successfully compared master document image and applies this template to the digital image. The template mapper provides as an output an image of each zone associate with the applied template. The zone optical character reader (OCR) receives the zone images and creates as an output a zone data file of the characters in each zone image. The zone pattern comparator receives from the template database the pattern associated with the zone and compares the pattern to the zone data file. In the event that the pattern is found, the data matching the pattern digital is extracted. The extracted data parser receives the extracted data and parses it based on the pattern and populates the data field of the database record associated with the digital image which is stored in the extracted data database.
The method for the extraction of textual data comprises:
a) selecting from a database a master document image having associated therewith a template, zone, and associated with each zone a pattern comprised of one or more data segments containing a data sequence of one or more characters;
b) creating an unpopulated database table having one or more data records, each data record having one or more data fields for containing visible character data extracted from the digital image and associating the database table with the master document image and the database record with the digital image, and, for at least one of the data segments containing visible data associating it with a database field;
c) comparing the digital image to the master document image and upon a successful match occurring:
applying the template and zone therein to the digital image,
performing optical character recognition on the character images within the zone,
creating a zone data file containing the characters optically read from the zone;
comparing the zone data file with the pattern associated with the zone;
extracting the data in the zone data file that matches the pattern, and, for each data segment associated with a data field, populating the data field with the visible data extracted from the zone data file corresponding

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for data extraction from digital images does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for data extraction from digital images, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for data extraction from digital images will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2893968

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.