Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-08-06
2001-08-07
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06272495
ABSTRACT:
The present invention relates generally to the processing, storage and analysis of information in the form of free-format data, and particularly, but not exclusively, to a method and apparatus for interpreting free-format text.
BACKGROUND OF THE INVENTION
One of the main purposes of computer systems is to manage information. This management of information is performed internally by data management systems. Generally, data management systems may be divided into two categories: 1) Database management systems; and 2) Text search and retrieval systems.
The first type of data management system imports and manipulates data into internal representations so that the data may be located and modified. When required, these systems generate a suitable representation of this data which is read by humans or used by another system. This category of data management system includes: hierarchical, network, relational, object-oriented database management systems and knowledge based management systems.
Within hierarchical, network and relational databases, information about an entity (a transaction, a stock item, a person, a company, an address etc.) is usually referred to as a “record” (although sometimes a record may contain information about many entities). Within each record the various “attributes” of the entity are usually classified into “fields”.
Within object-oriented database management systems and knowledge based management systems these basic units may have other names such as “object” and the information regarding the object may have names such as “slot” or “member”. Each of the attribute fields/slots has a format which can be, for example, integer, real number, boolean, character etc. Others are records/objects. Some fields/slots have specific formats (e.g., date, time), but yet others are free-format text.
Once the database has been constructed, it may be used to perform the following operations:
Add a record/object
Locate and change a record/object
Locate and delete a record/object
Retrieve information
These operations will be referred to as “normal database operations”.
Storing of information about an entity in fields/slots is suitable for many types of data. There are however, some types of data which do not have a suitable standard structure. One best example of data which does not have a standard structure is “address” data. As most databases store people's address information in one, two or three free-format fields, performing normal database operations on individual attributes of the address is very difficult. Note that the term “attribute” is used in this specification to refer to a property of an “element” of data.
For example, the free-format data “35 Pitt Street, NORTH SYDNEY” has a number of “elements”. Each element has an associated “attribute”. An attribute of the element “NORTH” is that it is a “geographical indicator”. An attribute of the element “12” is that it is a “number”. Note that the “low level” elements correspond to the “tokens” of data i.e., the element “NORTH” is a token of the data. The data also includes higher level elements, however, e.g., “NORTH SYDNEY” is an element which includes two tokens and this element has the attribute of being a “town”. An attribute of the entire data “12 Pitt Street, NORTH SYDNEY”, i.e., the total “element” is that it is an “address”. An alternative term for element is “component”.
For each element of this free-format data to be provided with its own field for the associated attribute would increase the size and complexity of the database quite significantly, even for this simple example of addresses. Where the database includes information on people, together with their addresses, for example, in order to avoid complexity, and particularly with older databases, address data may be stored in a single field labelled “address”. This field contains the address in free-format form and it is therefore not possible with present database technology to perform normal database operations on individual elements of the address—those elements cannot be accessed separately (apart from the total combination of elements which makes up the address, which can of course, be accessed as a whole, as “address”).
This problem is to some extent addressed by the science of database scrubbing/cleansing. This field of commercial endeavour applies parsing processes to free-format text with the objective of creating new database fields for the attributes of the free-format text and entering into those fields completely standardised data. This standardising of data includes converting all spelling variations into one consistent set. (eg “Street”→“St”.) The above example would produce the following:
House Number
Street Name
Street Type
City
12
Pitt
St
Sydney
The new database fields are then used to perform normal database operations. An entire industry is devoted to this field, applying large, complex and expensive software packages to take information stored in databases, analyse and process the information to produce new databases including more fields for the attributes of the information records, thus providing more flexibility for operations which can be applied to the records.
Much has been written about the field of database cleansing/scrubbing (see e.g., “Dealing with Dirty Data” DBMS Magazine, September, 1996). The process is expensive—a complete cleansing operation for a large database can cost millions of dollars, as it is so time consuming and the software packages that have been developed to cleanse databases are very complex—and it is still limited by the fundamental requirement that to perform database operations on an element, the element must have a field to itself.
This brings us to the second major problem which afflicts the present methods of storing computerised information in commercial databases. Practically all commercial data is stored within hierarchical, relational databases or flat data files which have a structure which is fixed at time of design, but information by its very nature is complex and can have almost an infinite number of different attributes. To create a database containing fields for each and every attribute for each and all types of different information is just not practical, if not totally impossible, and certainly the cost of any attempt to produce a database containing fields for all the types of information available to humanity would be cost prohibitive.
Even a fairly trivial (although very important) example illustrates the scale of the problem. Consider international addresses, i.e., addresses the world over. Although four or five free-format fields can contain any address, to design a database table which has a data field for every possible attribute of all international addresses would contain hundreds, if not thousands of data fields. England has counties, USA and Australia have states, Japan has districts and different orders of addresses, etc.
The field of database cleansing/scrubbing is therefore a partial solution at best. It still requires the same fundamental database structure of a field for each data attribute. One can build more and more complex databases but this problem can never be completely resolved, and limits the computerised handling of information significantly.
Natural language processing systems are known that employ “Semantic Grammars” to encode semantic information into a syntactic grammar. These systems are mainly used to provide natural language interface to other systems such as a data base management system. The following description comes from a book by Patterson, D. W. “Artificial Intelligence and Expert Systems”.
“. . . They use context-free rewrite rules with non-terminal semantic constituents. The constituents are categories or metasymbols such as attribute, object, present (as in display or print), and ship, rather then NP (Noun Phase), VP (Verb Phase), N (Noun), V (Verb), and so on. . . . Semantic grammars have proven to be successful in limited applications including LIFER, a data base query system distributed by the (US ) Navy . . . and a tutorial system
Alam Hosain T.
Colbert Ella
Davis & Bujold P.L.L.C.
LandOfFree
Method and apparatus for processing free-format data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for processing free-format data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for processing free-format data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2464152