Natural-language information processor with association...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C706S045000, C706S062000

Reexamination Certificate

active

06374242

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to information extraction from natural language text, and more particularly to a block finder for interpreting the structure of text documents in relation to their content, for aiding in establishing search blocks for information extraction.
BACKGROUND OF THE INVENTION
Many organizations maintain databases of information which they believe to be important to their functions. Formerly, such organizations might employ people to read newspaper and magazine articles and other text reports, and to summarize the information in the relevant database, whatever its form. More recently, the advent of computerized systems has made it possible to perform some of these tasks by use of information extraction or natural-language processing software operating on digitized text. The digitized text can be generated by optical scanning of paper reports to produce and using character recognition software. As an alternative, the digitized text may be derived from on-line publications, as on the internet.
Natural-language processing systems or software attempt to extract information from digitized text by a variety of techniques, including linguistic parsing, pattern matching, statistical methods, or a combination thereof. These information extraction systems tend to apply a serial view of the text to be processed, in that one character follows another in the stream of text. This serial approach is limited in its ability to utilize an author's formatting clues to understand the correct grouping of the extracted information. For example, newspaper text is usually arranged in the form of paragraphs within columns. If an automatic information extraction system were to ingest and operate on this text without some idea of the boundaries of each story, it might incorrectly group unrelated information, such as news of a visit by a foreign dignitary with that of a drug lord's arrest. Natural-language processing systems are useful, but the art is not well advanced at this time.
Improved natural-language processors are desired.
SUMMARY OF THE INVENTION
An information extracting block finding method determines the structure of documents represented by one-dimensional text files. Most (if not all) computer files are stored as a one-dimensional sequence of characters. The method includes the step of extracting from the text files at least some symbols representing two-dimensional spatial information. The text file is at least temporarily stored, using the spatial information, in a memory having a two-dimensional structure of grid cells. For at least some of the grid cells, at least two grid cells orthogonally adjacent to the grid cell under consideration are examined. Each such grid cell under consideration is assigned at least one of (or from 0 to 4 of), (a) left, (b) right, (c) top, and (d) bottom edge attributes: the attributes are assigned to those boundaries between the grid cell under examination in which one of (a) the grid cell under examination includes a text symbol and the adjacent cell to the left lacks a text symbol, (b) the grid cell under examination includes a text symbol and the adjacent cell to the right lacks a text symbol, (c) the grid cell under examination includes a text symbol and the adjacent cell above the cell under examination lacks a text symbol, and (d) the grid cell under examination includes a text symbol, and the adjacent cell below the cell under examination lacks a text symbol, respectively. This generates a list of cell edges, each defined by its edge attribute and its end locations on the 2D grid. Cell edges having the same left, right, top, or bottom edge attribute and sharing a common end point are combined or joined, to thereby form left, right, top and bottom block edges. A block edge is defined by its edge attribute and the location of its endpoints. Block edges are distinguished from cell edges by the fact that they can have a length exceeding unity. Each top and bottom block edge is associated with those left and right edges having common end points therewith, to form closed two-dimensional regions. The spatial coordinates of a bounding box about each of the closed two-dimensional regions are determined. The block structure information thus produced may be used in various ways by the natural language processor. It can be used to refine the segmentation of the document ending sentences and paragraphs even when proper punctuation is not provided in the text. The blocking information can also be used to put a two-column document into reading order. When searching in the text for an information element which is associated with a particular other information element, the search is performed in that one of the bounding boxes which contains the other information element.
In a particular mode of the method, the step of determining the spatial coordinates of a bounding box includes the steps of identifying (i) the upper-left corner of the bounding box, and (ii) the lower-right corner of the bounding box. The upper-left corner is identified by selecting that point which represents the spatial coordinates of the intersection of (a) the projection of the topmost upper edge of the closed region with (b) the projection of the leftmost of the left edges of the closed region. The lower-right corner is identified as that point which represents the spatial coordinates of the intersection of (a) the projection of the lowermost lower edge of the closed region with (b) the projection of the rightmost of the right edges of the closed region.
In a particular version of the method, the step of performing the search includes the step of performing the search for the information element associated with a pronoun to that one of the bounding boxes including the pronoun.
In a preferred mode of the method of the invention, the text is prefiltered to eliminate single and double spaces between sentences. This step can be done before the step of examining grid cells and assigning edge attributes. The prefiltering may include the step of deeming to be a cell occupied by text each grid cell which is occupied by a space symbol and which is (a) bounded on the left by a text grid cell and (b) bounded on the right by a text grid cells including space symbols. The preferred method also includes the step of deeming to be a set of four text cells (TTTT) those cells having the form TWWT, where T represents a text cell, and W represents a space or white cell. More particularly, this includes deeming to be a set of four text cells all right-left space symbol grid cell pairs bounded to the right and left by text characters.


REFERENCES:
patent: 5604854 (1997-02-01), Glassey
patent: 5848416 (1998-12-01), Tikkanen
patent: 5852819 (1998-12-01), Beller
patent: 5963956 (1999-10-01), Smartt
patent: 6100985 (2000-08-01), Scheiner et al.
patent: 6134539 (2000-10-01), O'Connor et al.
patent: 6281974 (2001-08-01), Scheiner et al.
patent: 6292809 (2001-09-01), Edelman
patent: 6292810 (2001-09-01), Richards
Guerra, Concettina, “Survey of Parallel Algorithms for Structural Pattern matching”, Proceedings of the 12thIAPR International Conference on Pattern Recognition, 1994, vol. 3—Conference C: Signal Processing, pp. 275-278.*
Sartipi, Kamran et al., “A Pattern Matching Framework for Software Architecture Recovery and Restructuring”, Proceedings of the 8thInternational Workshop on Program Comprehension, IWPC 2000, Jun. 10-11, 2000, pp. 37-47.*
“An Evaluation of Coreference Resolution Strategies for Acquiring Associated Information”, by Lois c. Childs, published at pp. 179-184 in Advanced in Text Processing Tipster Program Phase II, Apr. 1994-Sep. 1996.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Natural-language information processor with association... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Natural-language information processor with association..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Natural-language information processor with association... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2877884

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.