Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-03-20
2003-12-09
Amsbury, Wayne (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06662190
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to processing data. In particular, the present invention is related to converting text into data records.
BACKGROUND OF THE INVENTION
Data extraction is the process of converting digital text to digital data records. For example, the text of a web page found on a web site that sells cars may be converted into a set of records, one record for each car that is offered for sale. Each car may be associated with “values” for its attributes of make, model, year, color and price. The set of attributes for a particular car make up the record associated with that car. Some of these attributes may have no value; that is, no value will be assigned to that attribute to indicate merely that no value was extracted from the web page text for that car.
A record for a particular car may include the value “Alpha Romeo” for the attribute of “make”; “red” for the value of the attribute “color”; and “$1900” for the value of the attribute of price. The other attributes, “model” and “year” are left blank.
Converting text information to records is useful because it allows searching, sorting and presenting of the data based on the values of the different attributes. However, not all records come from the same text or are presented in the same text format. Therefore, it is desirable to extract records from a variety of different texts in different formats. Usually, data extraction can be done by using software called a “data extractor” that is tailored to the text format of interest, one extractor for each type of text format. Alternatively, it is possible to develop a data extractor that deduces the format of the source text and then uses that format to guide it in extracting records. These data extractors are referred to as “automatic data extractors.” Automatic data extractors can be used on texts from different sources or on texts from the same source but where that text format may change from time to time.
In order to deduce the format of a newly encountered text, the automatic data extractor may use its knowledge of the attribute values and various other formats. Knowledge of attribute values, referred to as “domain knowledge” may be contained in a vocabulary list stored in the memory of a computer or digital storage device. To use the example given above, domain knowledge of color values of cars may include not only colors such as “red”, “blue” and “green” but a list of color labels such as “color” and “coloring.”
Referring now to
FIG. 1
, which depicts a schematic view of a prior art automatic data extractor, automatic data extractors may use focuser procedures to identify regions of interest in text that is read in. These procedures include vocabularies of “recognizers” to identify attribute values and labels in texts. To identify various formats, automatic data extractors use “focus parsers” or “parsers”. Parsers identify regions in a source text where the text may contain attribute values. Recognizers can be use to evaluate regions of text identified by parsers. Because different parsers may be more appropriate for a given region of text known to have attribute values, the results provided by different parsers must be evaluated or “graded” for their appropriateness. The focus procedures that grade parsers in this way are called “focus graders” or “graders.” Thus, focuser procedures include three components: recognizers, parsers and graders. Recognizers are used by parsers to identify attribute values in a source text. Then, graders are used to determine which parser produced results that best fit the text.
For example, suppose the source text is a web page that contains a banner advertisement, several pages of free text, and a table that contains record data. The goal of the focuser procedure is to identify the region of interest; here, the table. First, each focus parser is applied to the text. One focus parser may identify the free text and another may identify the table. The first parser returns the free text region; and the second one, the table region. The focus grader is applied to both regions returned. The graders determine which region contains the most attribute values, or the most attribute values and labels, or the most attribute values and labels per number of words in the region, depending on the grading algorithm. The region that achieves the highest grade becomes the “region of interest.”
Automatic data extractors may also contain segmenter procedures that include segment parser and segment grader components. Segmenter procedures are designed to identify a series of “record regions” in the text that each contain data for a single record. If the region of interest is a table, each row of the table may include the attribute values of a record and thus be a record region.
After a region of interest has been identified, segmenter parsers are applied to it. The first parser may return each cell as a record region; the second, each row; and the third, each column. Then the segmenter grader is applied. Recognizers are again used by the graders to evaluate the different record regions returned. The graders apply the programmed algorithm which may penalize record regions returned that have fewer than one or more than one value per attribute. As before, the series of record regions that returns the best grade becomes the series of interest.
Once the record regions are identified, automatic data extractors produce records as follows. For each record region, a record is formed with all the attributes initially having no values. Then for each attribute, if there is at least one recognized value in the record region, that value becomes the record value. That first value is “extracted” from the text and entered into the data record.
This kind of automatic data extractor relies heavily on the domain knowledge in its vocabulary lists. The more comprehensive the list of recognizers, the better will be the deduction of source text information and the more complete the data records. Therefore, having a larger vocabulary list is better. However, building a large vocabulary list is labor intensive. Furthermore, vocabulary changes in time. New values come into existence and old values become out of date. Thus not only is building a large list labor intensive, so also is maintaining an up-to-date list. Thus there remains a need for a better way to develop and maintain vocabulary lists in automatic data extractors.
SUMMARY OF THE INVENTION
The present invention is a method for increasing the vocabulary of an automatic data extractor and it is also an automatic data extractor that automatically learns new vocabulary. The present automatic data extractor increases its vocabulary by learning as it is applied to extract data records from text. An automatic data extractor that learns new vocabulary can extract more data records from text. The present automatic data extractor uses domain knowledge to deduce data structure, then uses both the new structure and domain knowledge to extract new values not previously in its vocabulary and adds them to the records and to its vocabulary.
The method includes procedural components in addition to those in prior art data extractors, namely, field parsers and field graders. Each field parser is applied to a series of record regions to create a candidate series of field lists. Then the field grader uses recognizers to choose a single best series of field lists from the various candidates created by the field parsers. Next, an attribute mapper is applied to the selected series of field lists to determine the positions of the attributes in the list. Once it is known that a particular attribute corresponds to a particular position in the field lists, the fields in that position of the field lists are written as the attribute values to the corresponding record whether they are in the vocabulary list or not. In this way, new values of attributes are deduced or “gleaned” from a text source. If a field is not in the vocabulary list, it is added to the vocabulary list. Thus, the data extractor learns new vocabulary values
Bax Eric T
Pellico Julian
Amsbury Wayne
iSpheres Corporation
Mann Michael A.
Nexsen Pruet Jacobs & Pollard LLC
LandOfFree
Learning automatic data extraction system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Learning automatic data extraction system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Learning automatic data extraction system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3152416