Data processing: artificial intelligence – Knowledge processing system
Reexamination Certificate
2000-01-17
2003-01-14
Black, Thomas (Department: 2121)
Data processing: artificial intelligence
Knowledge processing system
C704S009000
Reexamination Certificate
active
06507829
ABSTRACT:
MICROFICHE APPENDIX
This application contains a microfiche appendix consisting of three microfiche comprising 127 frames.
FIELD OF THE INVENTION
The present invention is directed to a computer-based method and apparatus for classifying textual data.
BACKGROUND OF THE INVENTION
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the U.S. Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
The present invention is directed to a computer-based method and apparatus for classifying textual data. One application of the invention is a computer-based method and apparatus for classifying clinical trial adverse event reports. In the field of pharmaceuticals intended for the treatment of humans, clinical trials are used to validate the efficacy and safety of new drugs. These clinical trials are conducted with the participation of physicians, who monitor the health of persons involved in the trial.
Any symptom of a disease or malady, or degradation in the health of a patient participating in a clinical trial is termed an adverse event. Once such adverse events are observed or reported to the physician responsible for monitoring the health of a patient, an adverse event report is generated. These adverse event reports typically include short descriptions of the symptoms or health effect that resulted in the report. The reports generally omit all but those terms that are significant to the description of the adverse event being reported. However, given the nature of language, it is possible to describe one event in a very large number of ways. Accordingly, one patient who experiences headaches during a trial may have their symptoms described by a physician as “headache, migraine”, while another patient who experiences headaches may have their symptoms described as “migraine headache” or simply as “headache.” In addition to the variations in describing adverse events due to differing word combinations, the physicians who prepare the adverse event reports may resort to synonyms (e.g. describing pain in the abdomen as “pain in the stomach” or “pain in the belly”) or to abbreviations. Additionally, the reports are abbreviated in their syntax (e.g. “allergy, arms and legs” rather than “skin allergy on the arms and legs”). Adverse event reports are also often collected from all over the world. Therefore, adverse event reports can be in a number of languages, in addition to English.
The text that comprises an individual adverse event report is known as a verbatim. The verbatims must be collected and the information they contain must be sorted, so that the significance of the various symptoms and health effects reported in the verbatims can be considered. Traditionally this work has been carried out by humans who read the verbatims and assign them to predefined categories of adverse events. A number of systems exist for categorizing verbatims. These include WHOART, COSTART, and MedDRA. However, coding verbatims is tedious and human coders introduce a certain amount of error. Furthermore, a physician is often required to interpret verbatims and to put them into their proper classification. For reasons such as cost, however, physicians are generally not employed for such work.
Computer programs that exist for assisting human coders in properly and easily classifying verbatims in accordance with the above-mentioned systems suffer from a number of drawbacks. In particular, existing systems are often incapable of automatically coding verbatims that do not conform to a verbatim that has been coded before. Therefore, existing automated systems generally cannot code a verbatim that varies from previously coded verbatims in the significant terms it employs. Although existing systems may sometimes include the capability of memorizing new verbatims, so that coding will be automatic if a previously coded verbatim occurs again, such a capability has limited use. This is because, as described above, similar adverse events can be described in an almost infite number of ways. The possible combinations of words in a language number in the billions, even when the length of the combinations is restricted to those of five words or less.
Another impediment to producing reliable automated coding systems is that it is difficult to obtain real world verbatims that can be used as the basis for the automatic coding provisions of existing systems. This is because such data is usually proprietary, and even if not proprietary, is difficult to obtain. As a result, existing automated systems have typically been developed using the English language definitions of the categories set forth in the various classification schemes, rather than actual verbatims produced in the course of clinical trials.
As a result of the above-described limitations and difficulties, existing automated systems are rarely successful in identifying and classifying a verbatim that has not been seen before. Where a verbatim cannot be automatically coded, existing automated systems provide machine assistance in hand coding the verbatim. This is done by means of weak pattern matching functions, such as spelling normalization and stemming. Following pattern matching, the system typically offers the coder categories that the program has determined the verbatim may properly fall into.
Another difficulty in the field of clinical trial adverse event reporting is the translation of study results coded according to one classification scheme to another classification scheme. A system and method for performing such a translation would be useful because there is only limited correspondence between the categories of the various classification schemes. Translation is often desirable to compare the results obtained by different trials. However, at present, no program exists that can perform this function effectively and without large amounts of human assistance.
SUMMARY OF THE INVENTION
The present invention is capable of automatically coding the majority of the verbatims it encounters. In addition, the present invention is capable of reliably flagging those verbatims that still require coding by a human. Experiments have shown that existing auto coding systems are capable of coding only about one quarter of the verbatims in a study with high confidence. In contrast, the present invention is capable of auto coding approximately two thirds of the verbatims in a study with an error rate that is comparable to the error rate encountered using human coders. For those verbatims that the system of the present invention is incapable of auto coding, the human coder is presented with a list containing up to ten categories from which to choose the proper code. These categories are ordered, with the most likely ones appearing at the top of the list. By intelligently limiting the codes presented, the present invention is capable of improving the error rates encountered when coding verbatims. In addition, the automated system of the present invention is capable of coding a large number of verbatims in a short period of time, and of reducing the amount of time and level of intervention required of human coders. Furthermore, the present invention allows verbatims classified in one coding scheme to be translated to another coding scheme in a highly automated process.
The present invention uses a sparse vector framework to enable the system to, in effect, employ vectors of sizes up to 10
300
. This is done by performing the multiplication steps necessary to evaluate information-containing dimensions in the matrix only on those dimensions that have a non-zero value. This method allows the program to effectively use a large amount of knowledge gained from training data to evaluate natural language text that has not been seen by the system before.
According to an embodiment of the present invention, a count vector is constructed for each verbatim. The size o
Kornai Andras
Richards Jon Michael
Black Thomas
Hirl Joseph P.
PPD Development, Lp
Sheridon Ross P.C.
LandOfFree
Textual data classification method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Textual data classification method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Textual data classification method and apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3026934