File structure for scanned documents

Image analysis – Image segmentation – Region labeling

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S176000, C382S305000

Reexamination Certificate

active

06275610

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to systems and methods for information storage, retrieval, and transmission, and more specifically to systems and methods for storing, retrieving, and transmitting documents.
DESCRIPTION OF RELATED ART
A document scanned into a computer system can be represented by a raster image of the document. This image can be used to reproduce the document to a degree limited by the fidelity of the scanning and storage system. However, without performing character recognition on the image, the document cannot be searched or edited as a text document, limiting the overall practical utility of scanned images of documents.
Ordinarily, to overcome these shortcomings, a scanned document will be input into a character recognition program. The document can then be treated as a text document. Typical character recognition programs, however, have significant shortcomings, including misrecognition of text (i.e. misidentifying a “b” as an “h”), misidentifying fonts, and potentially, the loss of significant formatting information. Often these shortcomings of traditional character recognition programs can only be overcome through time consuming and potentially expensive detailed proofreading of the document by a human operator.
Furthermore, when a document is stored in a computer file, whether mechanically entered into a computer or scanned in and proofread, a typical computer file may store each and every occurrence of a word, phrase, picture or formatting instruction. As a result the exact same information may be repeated numerous times within the file. This redundancy means that more information is stored in the file than is required to represent the information content of the document. Files stored in this manner take more computer memory to store, more bandwidth to transmit, and more time to process. For example, such a document may require more time to search through due to the redundancy of the information stored in the file.
One system that is sometimes used to enable a document to be quickly searched involves textual indexing schemes which store exactly one copy of each word contained within an electronic document. Although this technique makes it easier to search a document for text or text patterns, textual indexing schemes are not able to recreate the original document since formatting and other information is lost.
What is needed is a file format which can overcome shortcomings listed above, including storing the information content present in a document while reducing the redundancy inherent in the document.
SUMMARY OF THE INVENTION
The present invention provides an electronic file and file structure solution for comprehensive management of traditional word processor documents and documents captured as scanned images, raster images or a representation. In an embodiment of the present invention a representation of a document is received. The representation of the document includes a plurality of objects. The locations in the document of objects in the plurality of objects are identified. A plurality of sets of objects in the plurality of objects is generated wherein objects in the plurality of objects in each set in the plurality of sets are classified as similar. A file is created containing the locations of objects in the plurality of objects and one copy of an object from each set in a group of sets in the plurality of sets.
According to one aspect of the present invention objects in the plurality of objects include characters. According to another aspect of the present invention the file contains at most one object from each set in the plurality of sets. According to still another aspect of the present invention, the group of sets includes all of the sets in the plurality of sets. According to yet another aspect of the present invention, any suitable imaging device can be used to generate the representation of the document, including but not limited to a scanner, a fax machine, a photocopier, a digital photocopier, or a hand-held screen input computer.
In another embodiment of the invention a resource receives a representation of a document. The representation includes a plurality of objects. Objects in the plurality of objects are classified as similar. A resource identifies locations in the document of objects in the plurality of objects. A resource creates a file. The file contains one copy of each object in the plurality of objects classified as different and the locations of the objects in the document.
In yet another embodiment of the present invention a resource receives a representation of a document. The representation includes a plurality of objects. A resource identifies locations in the document of objects in the plurality of objects. A resource generates a plurality of sets of objects in the plurality of objects wherein objects in the plurality of objects in each set in the plurality of sets are classified as similar. A resource creates a file containing the locations of objects in the plurality of objects and containing one copy of an object from each set in a group of sets in the plurality of sets. In one aspect of the invention objects in the plurality of objects include characters. In another aspect of the invention the file contains at most one object from each set in the plurality of sets. In yet another aspect of the present invention the group of sets includes all of the sets in the plurality of sets.
The file thus may contain all of the information required to faithfully reproduce the original document. In order to reconstruct the document, the objects are placed at the locations identified in the file. The file stores the location of the objects in any format which retains enough information content to allow the original document to be reproduced to the extent desired by the user. For example, location information can be stored as absolute coordinates of the objects in the document, or as relative coordinates of the objects with respect to each other. Additionally, the location information can be stored as a distance from a fixed point in the document such as the upper left-hand corner, or the location information could be stored as the distance of the objects from a calculated point or a user defined point such as the center of a page, or the centroid of the objects on the page. In one aspect of the invention, the location information is stored in a spatial location index in the file.
According to another aspect of the present invention the representation is a representation format of a type generated by a scanner, an imager, a fax machine, a photocopier, a digital photocopier, or a hand-held screen input computer. In one aspect of the invention, the representation includes only bit mapped images or image primitives, but not traditional word processor application formatting codes or text codes.
According to yet another aspect of the invention, objects can be classified into sets of similar objects depending on user preferences and specific applications. For example, a user may desire to classify an “e” in Helvetica font being the same as a “e” in Times font, but a user may want a five pointed star with rounded comers to be classified as different from a five pointed star with pointed comers, and it is noted that some sets may contain only one object. In this aspect of the invention, the file will store only one copy of an object from each set of objects which the user wishes to classify as similar, and the file will store the location or locations of each object within the original document.
According to this aspect of the invention, only those distinctions which are important to the user are noted. This saves storage space, reduces processor time, and allows the file to be more quickly transmitted over a network. According to one aspect of the present invention, similar objects can be identified and classified in the representation as follows. When a new object is identified in the representation, the identified object is used as a template to search the representation for similar objects. The template will

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

File structure for scanned documents does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with File structure for scanned documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and File structure for scanned documents will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2459384

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.