Data processing: measuring – calibrating – or testing – Measurement system – Measured signal processing
Reexamination Certificate
2001-01-02
2004-03-30
Nghiem, Michael (Department: 2863)
Data processing: measuring, calibrating, or testing
Measurement system
Measured signal processing
Reexamination Certificate
active
06714897
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to methods for exploratory analysis of categorical data. More specifically, the invention is a method for generating analyses of categorical data that will allow the application of exploratory multivariate analysis procedures.
BACKGROUND OF THE INVENTION
A categorical measurement on an object is a measurement that takes one of a set of known, fixed values, but has a discontinuous relationship with a previous or next measurement. For example: an observation as to whether a switch is “on” or “off” is a categorical measurement; the answer to each question in a political poll or other survey is a categorical measurement. Clock, calendar, and angle measure are also categorical data inasmuch as there are discontinuities, for example 60 minutes per hour, leap year, and 60 minutes per degree.
In addition to clinical and survey data [the “multiple choice” parts of a survey (as opposed to the free text)], other forms of categorical data include but are not limited to data mining, patents, warranty cards, and combinations thereof. Much of data that are often the subject of “data mining” (e.g. for marketing) are categorical (e.g. income level, age bracket, favorite sports and hobbies). However, the size of the data sets to be analyzed in some data mining applications are of a much larger scale than the anticipated size of clinical trials data sets. Patents, thought of as data, contain significant categorical data, and significant data of other types.
Table 1 shows a typical matrix arrangement of categorical data. For definiteness and convenience the data are discussed as though they are obtained as the result of a survey, poll, or questionnaire of multiple choice questions. In the table, each object (or individual) responds to the 4 questions. Possible values are shown for two of the objects; the answers for the first 3 questions are listed in a manner suggesting some character-coded response. The fourth question is listed as though the response is one of a finite list of positive whole numbers. Note that different questions can have different numbers of allowable answers and different coding schemes.
TABLE 1
EXAMPLE OF CATEGORICAL DATA, QUESTIONAIRE
Person/Object
Q1
Q2
Q3
Q4
1
D
A
Q
1
2
B
B
S
99
3
A
A
Q
2
. . .
M
A
C
D
4
. . .
N
D
C
B
2
The current general strategy for summarizing categorical data is to model all of the outcomes of a question (E.G. Q
1
) as representing outcomes from a single probability distribution, for example a multinomial. Previously, categorical data have been difficult to use for exploratory cause-effect analysis. Most often a query or hypothesis is posed and categorical data is collected and tested statistically to confirm or deny the query or hypothesis. Further treatments of such data (see references [1], [2], and [3]) concentrate largely on describing classes of probabilistic models that might explain or fit the data; the resulting models are then used to confirm whether suspected effects exist. Some methodology for exploratory analysis of categorical data is presented in [4]; these methods focus on calculating optimized encodings of categorical (and other) data.
However, categorical data may contain useful information, supporting a second hypothesis if you will, beyond the data needed to address the first hypothesis, which would not be recognized by methods focused on the first hypothesis. For example, clinical treatments, designed for a particular purpose, sometimes have desirable side effects. Discovering beneficial side effects and the conditions under which they occur can lead to medically and economically significant pharmaceutical products. Isolating detrimental side effects and the conditions under which they occurs is also clinically useful. Relevant data to uncovering these side effects arise from clinical trials when a patient's symptoms and associated properties, either elicited or reported to the health care provider, are encoded into standard classes.
Work with similar intent, that is, retrieving objects similar to a specified object, or summarizing the relations among objects (but using different typed data) has been long underway in the information retrieval community [5], [6]. However, the data in these works are unstructured text.
Hence there is a need for a method of handling categorical data in a manner that permits identification of additional hypotheses and relationships in the data.
Background References
[1] Y. M. M. Bishop, Feinberg, S. E. and Holland, P. W.
Discrete Multivariate Analysis: Theory and Practice
. MIT Press, 1975.
[2] Alan Agresti. Categorical Data Analysis. John Wiley & Sons. 1990.
[3] N. E. Breslow and N. E. Day.
Statistical Methods in Cancer Research
. IARC Scientific Publications No. 32. 1980.
[4] George Michailidis and Jan de Leeuw, “GIFI System of Descriptive Multivariate Statistics”
Statistical Science
13(4) 307-336, 1998.
[5] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990)—no figures, “Indexing by latent semantic analysis.”
Journal of the Society for Information Science,
41(6), 391-407.
[6] Howard R. Turtle and W. Bruce Croft. “A comparison of text retrieval models.”
Computer Journal,
35(3): 279-290, Jun. 1992.
SUMMARY OF THE INVENTION
The present invention provides a method of generating analyses of categorical data that will allow the application of exploratory multivariate analysis procedures constructed from inner products, distances, vector additions, and scalar multiplications to said categorical data having a plurality of responses. The method comprises the steps of encoding categorical data to provide a plurality of probability distribution representations, transforming exploratory multivariate analysis procedures based on inner products, distances, vector additions and scalar multiplications to work with probability distribution representations, and applying the transformed exploratory multivariate analysis procedures to the probability distribution representation to allow browsing, retrieving and viewing of said converted categorical data.
Whereas previously, each response to a question might have been modeled as an outcome from a multinomial probability distribution, according to the present invention each response is represented as a probability distribution. With this encoding or conversion, the vector of measurements for each individual can be viewed as a member of the linear space that includes vectors of probability distributions.
An advantage of the present invention is that existing methods for representing and manipulating numerical data can be adapted for the converted categorical data. In other words, the representation of categorical data as vectors of discrete probability distributions allows the use of standard clustering, projection, and/or visualization algorithms. A collection of vectors of probability distributions can be used to create a linear space; by the standard method of taking all linear combinations of the vectors of probability distributions. The present invention has the further advantage of permitting identification of more than one hypothesis from a categorical data set.
The data are represented and treated so that a visual, exploratory analysis of the data becomes possible. The present invention effectively permits the data to suggest hypotheses by virtue of the distribution encoding and adaptation of existing exploratory analysis methods.
One object of the present invention is to provide a method for clustering objects based on categorical measurements taken on objects and/or responses.
Another object is to provide a method for trending objects and/or responses.
It is a further object to provide a method for trending objects and/or responses based on categorical measurements taken on said objects and/or responses.
Yet another object is to provide a method for segmenting a sequence or time series of objects and/or responses.
It is a further
Daly Don S.
Ferryman Thomas A.
Whitney Paul D.
Battelle (Memorial Institute)
Lau Tung
Nghiem Michael
Woodard Emhardt Moriarty McNett & Henry LLP
LandOfFree
Method for generating analyses of categorical data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for generating analyses of categorical data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for generating analyses of categorical data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3265975