Data processing: artificial intelligence – Adaptive system
Reexamination Certificate
2000-05-19
2003-12-09
Khatri, Anil (Department: 2121)
Data processing: artificial intelligence
Adaptive system
C706S045000, C706S011000
Reexamination Certificate
active
06662168
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to methods and systems for automated text processing, and specifically to methods for automated coding of textual data.
BACKGROUND OF THE INVENTION
The tasks involved in conducting a large-scale survey, such as a population census, generally fall into three essential stages:
Data collection, generally either by filling out paper forms or electronic data entry;
Data coding, in which data collected in free text form are converted into unambiguous codes, typically numbers or alphanumeric values; and
Data analysis.
The present patent application is concerned with the coding stage. In response to a given question, such as “What is your occupation?”, there are typically many different answers that can correspond to the same code. As a simple example, the responses “I drive heavy trucks” and “driver of a heavy truck” should receive the same code. A computer, however, will have a difficult time recognizing this fact. Because of such ambiguities, coding has not generally been automated up to now. The personnel engaged to perform the coding must have a high level of expertise, including familiarity with coding procedures and with a large catalog of codes that is typically provided for this purpose. For example, coders must know whether such job descriptions as “childcare worker,” “babysitter,” “nanny” and “playgroup assistant” fall under the same coding classification or different ones. The same coder must be capable of coding “semi-trailer driver” and “driver of a heavy truck.” Because of the huge volume of data to be coded, with relatively little computer assistance, and the high level of skill that is required, the coding stage is generally the single most expensive activity in a census.
The Inference Group, of Manuka, Australia, offers a system known as “Precision Data” for automated coding of textual data. The system is described at www.inferencegroup.com.au. Precision Data offers two types of automated coding: automatic coding, performed by a computer strictly without human intervention, giving either one or no answer; and computer-assisted coding, wherein the computer output may be zero, one or several answers. In the latter case, a human coder must choose a code from a list suggested by the system. Precision Data is based on a coding engine, which is described as a “semi-linguistic” system. The engine parses input phrases, looks up words and other objects in a dictionary, and calculates a confidence level. The dictionary links the words to a classification index. A selection algorithm is then used to determine if there is an acceptable coding match. Coding parameters can be set to control how strict or loose a match must be in order to be acceptable. The system has a user interface with different levels of user access.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide improved methods and systems for automated coding of textual data.
It is a further object of some aspects of the present invention to provide interactive methods for automated coding that make more efficient use of human coding resources.
It is still a further object of some aspects of the present invention to provide methods and systems for automatic coding of textual data with enhanced accuracy and speed.
In some preferred embodiments of the present invention, an automatic text coding system receives a collection of reference phrases along with their corresponding codes, which have been assigned by one or more human experts. After preprocessing the text to remove superfluous words and characters, the system analyzes the phrases to generate respective code lists for all of the remaining words. The code list for any given word includes the codes assigned to all of the phrases in which the word appeared. Preferably, a weight is assigned to each code in the code list, which reflects the likelihood of the code being the correct one when the given word appears in an unknown phrase. Thus, the system prepares the code lists substantially autonomously, based on coding results known to be correct.
The system subsequently uses the code lists to code further phrases whose coding is not known a priori. For each phrase, the system computes a respective cumulative matching score for each of the codes that appears in the code list of one or more of the words in the phrase. The matching score of a given code is determined by summing the weights listed for that code in the code lists of all of the words in the phrase (although the weight may be zero in some of the code lists). Preferably, the sum is weighted to account for factors such as the order of the words in the phrase. When the system finds that for a given phrase, one of the codes has a cumulative matching score much higher than the score of any other code, it unequivocally selects the code with the highest score. Furthermore, if the phrase exactly matches one of the phrases in the collection that was coded by human experts, the code assigned by the expert is preferably selected automatically.
In some preferred embodiments of the present invention, if there are a number of candidate codes for a given phrase that have roughly comparable cumulative scores, the system passes the phrase to a human specialist. Typically, multiple specialists are available, each with a particular field or fields of expertise. The system automatically chooses the most appropriate specialist, typically one who is expert in a category to which the candidate code with the highest score belongs. The system presents the human specialist with the candidate code or codes in the specialist's field of expertise. The specialist verifies or rejects the code (or indicates that he or she is unable to decide). In the case of rejection, if the next candidate code is in a different category, the phrase is passed on to another specialist with expertise in that category. The system thus makes optimal use of the human resources at its disposal, increasing the speed at which ambiguous phrases can be handled while reducing the level of training and ability required of most of the human operators.
It may also occur that the system is unable to find any codes with sufficient cumulative weights, or that there is an excessive number of codes, or that the chosen specialist (or specialists) rejected all of the candidate codes or was unable to reach a decision. In such a case, the phrase is passed to an expert human operator for manual coding. Optionally, methods of natural language processing, as are known in the art, are first applied in order to classify the field of the phrase, so that it can be routed to an operator with the appropriate field of expertise. Preferably, after the phrase has been coded, the phrase and its assigned code are added to the collection of reference phrases with known codes. The assigned code, with the appropriate weights, is then automatically added to the code lists of the words in the phrase, as described above. In this manner, the system automatically learns from the phrases that it was unable to code automatically.
Preferred embodiments of the present invention are thus based on a combination of a number of component inventive concepts. These concepts include automatic routing of phrases with suggested candidate codes to appropriately-specialized human operators, and automatic learning of codes from previously-coded text. It will be understood, however, that these inventive concepts may also be used independently of one another. Furthermore, while preferred embodiments described herein are directed to coding of text phrases, the principles of the present invention may also be applied in automated coding of data of other types. Such coding may be used, for example, in classifying images (as in automated visual inspection or sorting) or sounds. All such applications are considered to be within the scope of the present invention.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for automated coding of a text phrase relative to a catalog
Wallach Eugene
Zlotnick Aviad
Hirl Joseph P.
Khatri Anil
LandOfFree
Coding system for high data volume does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Coding system for high data volume, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Coding system for high data volume will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3156119