Data analyzing method for generating rules

Data processing: artificial intelligence – Knowledge processing system – Knowledge representation and reasoning technique

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C706S045000, C706S048000, C706S059000, C706S061000

Reexamination Certificate

active

06321217

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to a data analyzing method and system for analyzing a collection of data expressed in terms of numeric values or symbols which are stored in an information storage unit as a data base. More particularly the present invention relates to data analyzing method and system for analyzing a collection of data in a data base and processing and converting the analyzed data to obtain an expression or rule useful to users.
With the advancement of the computer technology, the volume of data accumulated in a computer has been increased year by year. This tendency is becoming more and more remarkable mainly in on-line systems as networking advances. At present, one million in terms of the number of records, which corresponds to giga (=10
9
) bytes, is by no means rare.
Data stored in a computer are a mere collection of numerical values or symbols. In view of this point there have been proposed techniques for converting such a collection of data into information useful to users to thereby attain an effective utilization of the data. The method known most widely is a statistical method involving correlation analysis and multiple regression analysis.
Further, as a relatively new method there is known a method involving conversion into a rule form easily understandable to users such as IF, THEN rules (if . . . , then . . . is . . . ,) that is, a method which uses a knowledge acquisition method called rule induction. For example, on pages 23~31 of Hitachi Creative Work Station
2050
(trade name) ES/TOOL/W-RI Explanation/Operation Manual there is described a method which expresses a relation present between data in the form of a rule.
The method originally aimed at creating from given data a rule capable of being inputted to an expert system. However, such a method is applicable to the purpose that a user as a human is to find out characteristics such as causality and regularity which are contained in stored data.
The above described conventional method aims at creating a rule capable of being utilized by a computer. Although it is possible for the user as a human to interpret the rule, the rule is not formed in an easily understandable form to the human. Thus it has been impossible to create a rule suitable for a human to interpret the rule and understand characteristics of the data used. The above described method will be explained in more detail below using various examples.
First, suppose that data is a collection of individual events. For example, in an application method of analyzing the cause of a semiconductor defect by using a quality control data base in a semiconductor manufacturing process, each individual case is managed in a manufacturing unit called wafer, and a set of information pieces such as processing parameters in each manufacturing step or various test results can be handled as one case.
FIG. 1
shows examples of such data.
In a method of checking a financial commodity purchasing trend of each customer from a customer data base kept by a bank, a set of such information pieces for each customer as age, deposit balance, occupation, annual income and financial commodity purchasing history is one case, and the data to be analyzed can be regarded as a collection of such data. As to this example, a detailed explanation will be given in an embodiment of the invention which will be described rater.
Reference will now be made to an example of forming a rule according to the above described conventional method. As an example, suppose that features common to customers who have bought a certain financial commodity (“commodity A” hereinafter) are to be checked. In this case, it is an object to create a rule for classifying as accurately as possible between cases corresponding to the customers who have bought the commodity A and cases corresponding to the customers who have not bought the same commodity.
According to the foregoing conventional method, from among sets of item values (e.g. “The age is 40 or more and the deposit balance is 10,000,000 yen or more.”), there is created a set which classifies given data most accurately. In this case, the term “accurately” is used in the following sense. In a subset of cases having specific values, the higher the proportion of the cases corresponding to the customers who have bought the financial commodity A, the more accurately are classified features of those customers. This set of values can be expressed in the form of a rule such as “IF the age is 40 or more and AND the deposit balance is 10,000,000 yen or more, THEN purchase the financial commodity A.”
Next, the case which is explained by the created rule is removed from the entire set of cases. In the above example, the case which satisfies the condition of the age being 40 or more and the deposit balance 10,000,000 yen or more is removed. With respect to the remaining set of cases, there is determined a set of items which makes classification most accurately. By repeating these processings it is possible to obtain a group of rules for distinguishing the customers who have bought the financial commodity A from the customers who have not bought the same commodity.
As will be seen from the above explanation, the rule group obtained by the foregoing conventional method takes the form of IF . . . ELSE IF . . . ELSE IF . . . like IF the age is 40 or more and the deposit balance is 10,000,000 yen or more, THEN purchase a financial commodity B, ELSE IF the occupation is self-employed AND the annual income is 8,000,000 yen or more, THEN purchase the financial commodity B, ELSE IF . . .
In the case where a computer makes classification by using this rule group, the processings can be executed mechanically merely by checking successively from the head IF. However, the larger the number of rules, the more difficult is it for a human to understand features of the customers who have bought the financial commodity A. A further problem is that as the number of cases increases, the processing time required increases rapidly, because the processing of searching for a rule from the remaining set of cases is repeated at every generation of a rule.
There still exists a serious problem such that in the case of data in the actual world like those in the above example, the data must generally be regarded as containing very large noises. That is, as to whether the financial commodity A is to be purchased or not, the decision to purchase may be influenced by items not contained in the data base, so it is impossible to expect the formation of a highly accurate classification rule. Further, in the case of analyzing the cause of a semiconductor defect referred to above, large noises are contained in the data because the occurrence of a defect is influenced by factors which vary randomly. Also in such a case it is often difficult to request the formation of a definite rule.
Against the above-mentioned problems it is effective to adopt an analyzing method which expresses rough features of data. In the foregoing conventional method, however, the value of many items are combined and a search is made for a rule as accurate in classification as possible, so there generally occurs a phenomenon such that the number of conditions appearing in the IF portion of the rule increases, but the number of cases falling under the rule decreases. Consequently, it is difficult to satisfy the purpose of understanding rough features of data.
In the actual data base are stored a wide variety of information pieces. Those obviously having nothing to do with the purpose of the analysis are also included such as wafer number and manufacturing start year, month, date in the foregoing semiconductor quality control data, as well as name and telephone number in customer data. On the other hand, there also are information pieces which may be effective in the analysis such as product classification code in the semiconductor quality control data and address in the customer data.
In making such analysis as in the above example by using data comprising such many kinds of information pie

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Data analyzing method for generating rules does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Data analyzing method for generating rules, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data analyzing method for generating rules will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2592596

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.