Rollup functions for efficient storage presentation and...

Image analysis – Pattern recognition – Context analysis or word recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S226000, C382S228000, C382S187000, C704S251000, C715S252000

Reexamination Certificate

active

06597809

ABSTRACT:

TECHNICAL FIELD
The present invention relates to computer-implemented methods and data structures for producing candidate parent entities that are ranked in accordance with ranking information associated with given child entities and, in particular, to such methods for use with software parsers and data dictionaries, for example, of the kind utilized in a system for automated reading, validation, and interpretation of hand print, machine print, and electronic data streams.
BACKGROUND OF THE INVENTION
Optical character recognition (OCR) systems and digital image processing systems are known for use in automatic forms processing. These systems deal with three kinds of data: physical data, textual data, and logical data. Physical data may be pixels on a page or positional information related to those pixels. In general, physical data is not in a form to be effectively used by a computer for information processing. Physical data by itself has neither useable content nor meaning. Textual data is data in textual form. It may have a physical location associated with it. It occurs in, for example, ASCII strings. It has content but no meaning. We know what textual data says, but not what it means. Logical data has both content and meaning. It often has a name for what it is.
For example, there may be a region of black pixels in a certain location on an image. Both the value of the pixels and their location are physical data. It may be determined that those pixels, when properly passed through a recognizer, say: “(425)
867-0700
” Content has been derived from the physical data to generate textual data. If we now know that text of this format (or possibly at this location on a preprinted form) is a telephone number, the textual data becomes logical data.
To facilitate reconciliation of imperfections in physical data and shortcomings of the recognition process, each recognized element of textual data, e.g., a character, may be represented by a ranked group of unique candidates called a “possibility set.” A possibility set includes one or more candidate information pairs, each including a “possibility” and an associated confidence. In the context of an OCR system, the confidence is typically assigned as part of the recognition process. For computational efficiency, the confidences may be assigned within an appropriate base-2 range, e.g., 0 to 255, or a more compact range, such as 0 to 7. For example,
FIG. 1
shows an enlarged view of an individual glyph
20
that may be physically embodied as a handwritten character or as a digital pixel image of the handwritten character. From glyph
20
, an optical character recognition process may generate the possibility set shown in TABLE 1 by assigning possibilities and associated confidences:
TABLE 1
possibility
confidence
c
200
e
123
o
100
FIG. 2
shows a series of sibling glyphs
22
, which are known as “siblings” because they share the same parent word
24
. The sibling glyphs
22
can be represented by the four possibility sets as shown in the following TABLE 2:
TABLE 2
poss
conf
poss
conf
poss
conf
poss
conf
c
200
h
190
o
100
r
125
o
150
n
100
a
80
n
100
e
100
r
80
The possibilities of these four possibility sets can be readily combined to form 36 unique strings: “chor”, “ohor”, “ehor”, “cnor”, “cror”, etc. The number of unique strings is determined by the product of the number of character possibilities in each possibility set, i.e., 3×3×2×2=36.
To gage or verify their accuracy, the unique “candidate” strings may be processed by a “dictionary” of valid outcomes. In the context of OCR, a dictionary is a filter. It has content and rules. Each candidate string processed by the dictionary is subject to one of three possible outcomes: it is passed, it is rejected, or it is modified into a similar string that passes. One example of a dictionary is based on the English language. For parent word
24
of
FIG. 2
, the candidate strings “chor” and “ehar” would be rejected by such a dictionary, while “char” would be passed.
Because dictionaries often have a very large amount of content against which a candidate string is compared, it may be unduly time-consuming to apply the dictionary to all possible strings. To improve efficiency, it is desirable, before applying a dictionary, to rank the candidate strings in order of some confidence based on the accuracy of recognition. In this way the candidate strings having the highest confidence of having been accurately recognized are processed by the dictionary first. Rules can then be used to determine when to stop dictionary processing, e.g., when enough candidate strings have been processed to have isolated the best candidate strings (with a certain probability). A convenient way to rank candidate strings is to calculate string confidences based on the confidences of the component character possibilities that make up each candidate string. A set of candidate strings and their associated string confidences is referred to as an “alt-set.”
One way to rank parent candidates for creating an alt-set is to add the child confidences for each parent candidate. In the above example, “chor” would have a ranking of 615 (the sum of the confidences associated with the individual characters c-h-o-r), “ohor” would have a ranking of 565, “ehor” would have a ranking of 515, etc. Combining the possibility sets to form the 36 unique strings and to calculate their rankings is simple in this example. However, there is no obvious way to read the strings out in ranked order. The strings must first be assigned a ranking, then ordered or sorted based on their assigned rank. This ordering or sorting step becomes especially problematic for longer strings formed from sibling possibility sets having a greater number of possibilities. By way of illustration, a hypothetical 10-character parent word in which each child possibility set includes
10
possibilities would result in 10 billion unique strings. It would be a very time-consuming and computationally expensive task to rank and order 10 billion 10-character strings.
Another known way of improving the efficiency of dictionaries is to use specialized dictionaries that contain smaller amounts of content than a more generalized dictionary but that are limited in their application. One such specialized dictionary is an “n-gram” dictionary, which includes information about the frequency in which certain character sequences (e.g., two-letter, three-letter, etc.) occur in the English language. For example, the two-letter combination “Qu” (a 2-gram) occurs in English words much more frequently than “Qo.” To benefit from an n-gram dictionary, the confidence assigned to an n-gram is some combination of (1) the aggregate character confidences and (2) the n-gram frequency provided by the n-gram dictionary. Thus, recognition may have produced Oueen and Queen where the first character has the possibility set: poss=O, conf=200; poss=Q, conf=100, but in the English language “Qu” happens much more often than “Ou”, so the 2-gram dictionary would help determine that Queen is the more likely parent string.
A need exists for a method of generating candidate strings in ranked order on an as-needed basis and, more generally, for a method of generating ranked parent candidates on an on-demand basis from a series of sibling possibilities. A need also exists for such a method that can be used with data at different logical levels in a logical data hierarchy, such as n-grams, words, and phrases.
SUMMARY OF THE INVENTION
In accordance with the present invention, methods of organizing a series of sibling data entities are provided for preserving sibling ranking information associated with the sibling data entities and for attaching the sibling ranking information to a joint parent of the sibling data entities to facilitate on-demand generation of ranked parent candidates. A rollup function of the present invention builds a rollup matrix containing information about the sibling entities and the sibling ranking information and provides a method for reading out t

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Rollup functions for efficient storage presentation and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Rollup functions for efficient storage presentation and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Rollup functions for efficient storage presentation and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3046880

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.